Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
From: Steven Rostedt @ 2017-12-28  4:16 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Masami Hiramatsu, Alexei Starovoitov, Josef Bacik, mingo, davem,
	netdev, linux-kernel, ast, kernel-team, daniel, linux-btrfs,
	darrick.wong, Josef Bacik, Akinobu Mita
In-Reply-To: <03e0ebb7-0b2a-4235-3408-c0d59a1ba4c2@fb.com>

On Wed, 27 Dec 2017 19:45:42 -0800
Alexei Starovoitov <ast@fb.com> wrote:

> I don't think that's the case. My reading of current
> trace_kprobe_ftrace() -> arch_check_ftrace_location()
> is that it will not be true for old mcount case.

In the old mcount case, you can't use ftrace to return without calling
the function. That is, no modification of the return ip, unless you
created a trampoline that could handle arbitrary stack frames, and
remove them from the stack before returning back to the function.

> 
> As far as the rest of your arguments it very much puzzles me that
> you claim that this patch suppose to work based on historical
> reasoning whereas you did NOT test it.

I believe that Masami is saying that the modification of the IP from
kprobes has been very well tested. But I'm guessing that you still want
a test case for using kprobes in this particular instance. It's not the
implementation of modifying the IP that you are worried about, but the
implementation of BPF using it in this case. Right?

-- Steve

^ permalink raw reply

* Re: [PATCH v3 bpf-next 2/2] tools/bpftool: fix bpftool build with bintutils >= 2.9
From: Jakub Kicinski @ 2017-12-28  4:19 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: netdev, linux-kernel, kernel-team, Alexei Starovoitov,
	Daniel Borkmann
In-Reply-To: <20171227191629.4920-2-guro@fb.com>

On Wed, 27 Dec 2017 19:16:29 +0000, Roman Gushchin wrote:
> Bpftool build is broken with binutils version 2.29 and later.
> The cause is commit 003ca0fd2286 ("Refactor disassembler selection")
> in the binutils repo, which changed the disassembler() function
> signature.
> 
> Fix this by adding a new "feature" to the tools/build/features
> infrastructure and make it responsible for decision which
> disassembler() function signature to use.
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Daniel Borkmann <daniel@iogearbox.net>

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

^ permalink raw reply

* Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
From: Alexei Starovoitov @ 2017-12-28  4:32 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Masami Hiramatsu, Alexei Starovoitov, Josef Bacik, mingo, davem,
	netdev, linux-kernel, ast, kernel-team, daniel, linux-btrfs,
	darrick.wong, Josef Bacik, Akinobu Mita
In-Reply-To: <20171227231644.168abc0f@vmware.local.home>

On 12/27/17 8:16 PM, Steven Rostedt wrote:
> On Wed, 27 Dec 2017 19:45:42 -0800
> Alexei Starovoitov <ast@fb.com> wrote:
>
>> I don't think that's the case. My reading of current
>> trace_kprobe_ftrace() -> arch_check_ftrace_location()
>> is that it will not be true for old mcount case.
>
> In the old mcount case, you can't use ftrace to return without calling
> the function. That is, no modification of the return ip, unless you
> created a trampoline that could handle arbitrary stack frames, and
> remove them from the stack before returning back to the function.

correct. I was saying that trace_kprobe_ftrace() won't let us do
bpf_override_return with old mcount.

>>
>> As far as the rest of your arguments it very much puzzles me that
>> you claim that this patch suppose to work based on historical
>> reasoning whereas you did NOT test it.
>
> I believe that Masami is saying that the modification of the IP from
> kprobes has been very well tested. But I'm guessing that you still want
> a test case for using kprobes in this particular instance. It's not the
> implementation of modifying the IP that you are worried about, but the
> implementation of BPF using it in this case. Right?

exactly. No doubt that old code works.
But it doesn't mean that bpf_override_return() will continue to
work in kprobes that are not ftrace based.
I suspect Josef's existing test case will cover this situation.
Probably only special .config is needed to disable ftrace, so
"kprobe on entry but not ftrace" check will kick in.
But I didn't get an impression that this situation was tested.
Instead I see only logical reasoning that it's _supposed_ to work.
That's not enough.

^ permalink raw reply

* Re: [PATCH iproute2 0/3] ip/tunnel: Fix noencap- and document "external" parameter handling.
From: Serhey Popovych @ 2017-12-28  5:31 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, u9012063
In-Reply-To: <20171227134712.12338c7c@xeon-e3>

[-- Attachment #1.1: Type: text/plain, Size: 1987 bytes --]

Stephen Hemminger wrote:
> On Wed, 27 Dec 2017 13:28:13 +0200
> Serhey Popovych <serhe.popovych@gmail.com> wrote:
> 
>> In this series I present next set of improvements/fixes:
>>
>>   1) Fix noencap- option handling: we need to clear bit, instead
>>      of seting all, but one we expect to clear.
>>
>>   2) Document "external" parameter both in ip-link(8) and help
>>      output for link_gre.c. Add "noexternal" option variant to
>>      bring inline with GENEVE and VXLAN.
>>
>>   3) Trivial: clear flowlabel/tclass from flowinfo in case of
>>      inherit to stop sending garbage to the kernel. It has no
>>      functional change, but follows similar behaviour in link_ip6tnl.c
>>
>> See individual patch description message for details.
>>
>> Thanks,
>> Serhii
>>
>> Serhey Popovych (3):
>>   gre,ip6tnl/tunnel: Fix noencap- support
>>   gre6/tunnel: Do not submit garbage in flowinfo
>>   ip/tunnel: Document "external" parameter
>>
>>  ip/link_gre.c         |    7 +++++--
>>  ip/link_gre6.c        |    4 ++--
>>  ip/link_ip6tnl.c      |    6 ++++--
>>  ip/link_iptnl.c       |    4 +++-
>>  man/man8/ip-link.8.in |    6 ++++++
>>  5 files changed, 20 insertions(+), 7 deletions(-)
>>
> 
> These are really disjoint. I applied the noencap and the flowinfo patch.
> Agree with William that having noexternal which does nothing useful is of little value now.

Agree with you guys, just added it to match GENEVE/VXLAN implementation
which is has "noexternal".

Will send v2 to just document "external" and clear flowlabel/tclass in
flowinfo in case of inherit.

I have plans for iproute2-next to merge link_{gre,vti}{,6}.c and
link_ip{6,}tnl.c functionality together and just want to do things
as clear as possible before doing merge.

This is to reduce size of ip(8) binary to help openwrt like solutions to
survive JSON print support that increases size of binary.

Thank you for quick review and good feed back!

> 

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply

* [PATCH net-next] cxgb4: display VNI correctly
From: Ganesh Goudar @ 2017-12-28  5:59 UTC (permalink / raw)
  To: netdev, davem
  Cc: nirranjan, indranil, venkatesh, Ganesh Goudar, Santosh Rastapur

Fix incorrect VNI display in mps_tcam

Signed-off-by: Santosh Rastapur <santosh@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
index d3ced04..4ea76c1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_debugfs.c
@@ -1743,7 +1743,7 @@ static int mps_tcam_show(struct seq_file *seq, void *v)
 			 */
 			if (lookup_type && (lookup_type != DATALKPTYPE_M)) {
 				/* Inner header VNI */
-				vniy = ((data2 & DATAVIDH2_F) << 23) |
+				vniy = (data2 & DATAVIDH2_F) |
 				       (DATAVIDH1_G(data2) << 16) | VIDL_G(val);
 				dip_hit = data2 & DATADIPHIT_F;
 			} else {
@@ -1753,6 +1753,7 @@ static int mps_tcam_show(struct seq_file *seq, void *v)
 			port_num = DATAPORTNUM_G(data2);
 
 			/* Read tcamx. Change the control param */
+			vnix = 0;
 			ctl |= CTLXYBITSEL_V(1);
 			t4_write_reg(adap, MPS_CLS_TCAM_DATA2_CTL_A, ctl);
 			val = t4_read_reg(adap, MPS_CLS_TCAM_DATA1_A);
@@ -1761,7 +1762,7 @@ static int mps_tcam_show(struct seq_file *seq, void *v)
 			data2 = t4_read_reg(adap, MPS_CLS_TCAM_DATA2_CTL_A);
 			if (lookup_type && (lookup_type != DATALKPTYPE_M)) {
 				/* Inner header VNI mask */
-				vnix = ((data2 & DATAVIDH2_F) << 23) |
+				vnix = (data2 & DATAVIDH2_F) |
 				       (DATAVIDH1_G(data2) << 16) | VIDL_G(val);
 			}
 		} else {
@@ -1834,7 +1835,8 @@ static int mps_tcam_show(struct seq_file *seq, void *v)
 					   addr[1], addr[2], addr[3],
 					   addr[4], addr[5],
 					   (unsigned long long)mask,
-					   vniy, vnix, dip_hit ? 'Y' : 'N',
+					   vniy, (vnix | vniy),
+					   dip_hit ? 'Y' : 'N',
 					   port_num,
 					   (cls_lo & T6_SRAM_VLD_F) ? 'Y' : 'N',
 					   PORTMAP_G(cls_hi),
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v6 0/6] net: tcp: sctp: dccp: Replace jprobe usage with trace events
From: Masami Hiramatsu @ 2017-12-28  6:10 UTC (permalink / raw)
  To: Ingo Molnar, David S . Miller, Ian McDonald, Vlad Yasevich,
	Stephen Hemminger, Steven Rostedt
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, H . Peter Anvin,
	Gerrit Renker, Neil Horman, dccp, netdev, linux-sctp,
	Stephen Rothwell, mhiramat

Hi,

This series is v6 of the replacement of jprobe usage with trace
events. This version fixes trace/events/tcp.h to avoid sparse
warning, as same as Mat Martineau's patch.

Previous version is here;
 https://www.spinics.net/lists/netdev/msg474057.html

Changes from v5:
  [1/6]: Avoid preprocessor directives in tracepoint macro args

Thank you,

---

Masami Hiramatsu (6):
      net: tcp: Add trace events for TCP congestion window tracing
      net: tcp: Remove TCP probe module
      net: sctp: Add SCTP ACK tracking trace event
      net: sctp: Remove debug SCTP probe module
      net: dccp: Add DCCP sendmsg trace event
      net: dccp: Remove dccpprobe module


 include/trace/events/sctp.h |   99 ++++++++++++++
 include/trace/events/tcp.h  |   97 ++++++++++++++
 net/Kconfig                 |   17 --
 net/dccp/Kconfig            |   17 --
 net/dccp/Makefile           |    5 -
 net/dccp/probe.c            |  203 -----------------------------
 net/dccp/proto.c            |    5 +
 net/dccp/trace.h            |  105 +++++++++++++++
 net/ipv4/Makefile           |    1 
 net/ipv4/tcp_input.c        |    3 
 net/ipv4/tcp_probe.c        |  301 -------------------------------------------
 net/sctp/Kconfig            |   12 --
 net/sctp/Makefile           |    3 
 net/sctp/probe.c            |  244 -----------------------------------
 net/sctp/sm_statefuns.c     |    5 +
 15 files changed, 317 insertions(+), 800 deletions(-)
 create mode 100644 include/trace/events/sctp.h
 delete mode 100644 net/dccp/probe.c
 create mode 100644 net/dccp/trace.h
 delete mode 100644 net/ipv4/tcp_probe.c
 delete mode 100644 net/sctp/probe.c

--
Masami Hiramatsu (Linaro) <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH net-next v6 1/6] net: tcp: Add trace events for TCP congestion window tracing
From: Masami Hiramatsu @ 2017-12-28  6:10 UTC (permalink / raw)
  To: Ingo Molnar, David S . Miller, Ian McDonald, Vlad Yasevich,
	Stephen Hemminger, Steven Rostedt
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, H . Peter Anvin,
	Gerrit Renker, Neil Horman, dccp, netdev, linux-sctp,
	Stephen Rothwell, mhiramat
In-Reply-To: <151444140014.11331.17287122201510506083.stgit@devbox>

This adds an event to trace TCP stat variables with
slightly intrusive trace-event. This uses ftrace/perf
event log buffer to trace those state, no needs to
prepare own ring-buffer, nor custom user apps.

User can use ftrace to trace this event as below;

  # cd /sys/kernel/debug/tracing
  # echo 1 > events/tcp/tcp_probe/enable
  (run workloads)
  # cat trace

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
---
 Changes in v6:
  - Avoid preprocessor directives in tracepoint macro args as
    Mat did on net tree.
---
 include/trace/events/tcp.h |   97 ++++++++++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_input.c       |    3 +
 2 files changed, 100 insertions(+)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 8e88a1671538..4dea6342f7d4 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -1,3 +1,4 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 #undef TRACE_SYSTEM
 #define TRACE_SYSTEM tcp
 
@@ -8,6 +9,7 @@
 #include <linux/tcp.h>
 #include <linux/tracepoint.h>
 #include <net/ipv6.h>
+#include <net/tcp.h>
 
 /*
  * tcp event with arguments sk and skb
@@ -277,6 +279,101 @@ TRACE_EVENT(tcp_retransmit_synack,
 		  __entry->saddr_v6, __entry->daddr_v6)
 );
 
+
+#define TP_STORE_ADDR_PORTS_V4(__entry, inet, sk)			\
+	do {								\
+		struct sockaddr_in *v4 = (void *)__entry->saddr;	\
+									\
+		v4->sin_family = AF_INET;				\
+		v4->sin_port = inet->inet_sport;			\
+		v4->sin_addr.s_addr = inet->inet_saddr;			\
+		v4 = (void *)__entry->daddr;				\
+		v4->sin_family = AF_INET;				\
+		v4->sin_port = inet->inet_dport;			\
+		v4->sin_addr.s_addr = inet->inet_daddr;			\
+	} while (0)
+
+#if IS_ENABLED(CONFIG_IPV6)
+
+#define TP_STORE_ADDR_PORTS(__entry, inet, sk)				\
+	do {								\
+		if (sk->sk_family == AF_INET6) {			\
+			struct sockaddr_in6 *v6 = (void *)__entry->saddr; \
+									\
+			v6->sin6_family = AF_INET6;			\
+			v6->sin6_port = inet->inet_sport;		\
+			v6->sin6_addr = inet6_sk(sk)->saddr;		\
+			v6 = (void *)__entry->daddr;			\
+			v6->sin6_family = AF_INET6;			\
+			v6->sin6_port = inet->inet_dport;		\
+			v6->sin6_addr = sk->sk_v6_daddr;		\
+		} else							\
+			TP_STORE_ADDR_PORTS_V4(__entry, inet, sk);	\
+	} while (0)
+
+#else
+
+#define TP_STORE_ADDR_PORTS(__entry, inet, sk)		\
+	TP_STORE_ADDR_PORTS_V4(__entry, inet, sk);
+
+#endif
+
+TRACE_EVENT(tcp_probe,
+
+	TP_PROTO(struct sock *sk, struct sk_buff *skb),
+
+	TP_ARGS(sk, skb),
+
+	TP_STRUCT__entry(
+		/* sockaddr_in6 is always bigger than sockaddr_in */
+		__array(__u8, saddr, sizeof(struct sockaddr_in6))
+		__array(__u8, daddr, sizeof(struct sockaddr_in6))
+		__field(__u16, sport)
+		__field(__u16, dport)
+		__field(__u32, mark)
+		__field(__u16, length)
+		__field(__u32, snd_nxt)
+		__field(__u32, snd_una)
+		__field(__u32, snd_cwnd)
+		__field(__u32, ssthresh)
+		__field(__u32, snd_wnd)
+		__field(__u32, srtt)
+		__field(__u32, rcv_wnd)
+	),
+
+	TP_fast_assign(
+		const struct tcp_sock *tp = tcp_sk(sk);
+		const struct inet_sock *inet = inet_sk(sk);
+
+		memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+		memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
+
+		TP_STORE_ADDR_PORTS(__entry, inet, sk);
+
+		/* For filtering use */
+		__entry->sport = ntohs(inet->inet_sport);
+		__entry->dport = ntohs(inet->inet_dport);
+		__entry->mark = skb->mark;
+
+		__entry->length = skb->len;
+		__entry->snd_nxt = tp->snd_nxt;
+		__entry->snd_una = tp->snd_una;
+		__entry->snd_cwnd = tp->snd_cwnd;
+		__entry->snd_wnd = tp->snd_wnd;
+		__entry->rcv_wnd = tp->rcv_wnd;
+		__entry->ssthresh = tcp_current_ssthresh(sk);
+		__entry->srtt = tp->srtt_us >> 3;
+	),
+
+	TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x "
+		  "snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u "
+		  "rcv_wnd=%u",
+		  __entry->saddr, __entry->daddr, __entry->mark,
+		  __entry->length, __entry->snd_nxt, __entry->snd_una,
+		  __entry->snd_cwnd, __entry->ssthresh, __entry->snd_wnd,
+		  __entry->srtt, __entry->rcv_wnd)
+);
+
 #endif /* _TRACE_TCP_H */
 
 /* This part must be outside protection */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4d55c4b338ee..ff71b18d9682 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5299,6 +5299,9 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 	unsigned int len = skb->len;
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	/* TCP congestion window tracking */
+	trace_tcp_probe(sk, skb);
+
 	tcp_mstamp_refresh(tp);
 	if (unlikely(!sk->sk_rx_dst))
 		inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb);

^ permalink raw reply related

* [PATCH net-next v6 2/6] net: tcp: Remove TCP probe module
From: Masami Hiramatsu @ 2017-12-28  6:11 UTC (permalink / raw)
  To: Ingo Molnar, David S . Miller, Ian McDonald, Vlad Yasevich,
	Stephen Hemminger, Steven Rostedt
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, H . Peter Anvin,
	Gerrit Renker, Neil Horman, dccp, netdev, linux-sctp,
	Stephen Rothwell, mhiramat
In-Reply-To: <151444140014.11331.17287122201510506083.stgit@devbox>

Remove TCP probe module since jprobe has been deprecated.
That function is now replaced by tcp/tcp_probe trace-event.
You can use it via ftrace or perftools.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
---
 net/Kconfig          |   17 ---
 net/ipv4/Makefile    |    1 
 net/ipv4/tcp_probe.c |  301 --------------------------------------------------
 3 files changed, 319 deletions(-)
 delete mode 100644 net/ipv4/tcp_probe.c

diff --git a/net/Kconfig b/net/Kconfig
index 9dba2715919d..efe930db3c08 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -336,23 +336,6 @@ config NET_PKTGEN
 	  To compile this code as a module, choose M here: the
 	  module will be called pktgen.
 
-config NET_TCPPROBE
-	tristate "TCP connection probing"
-	depends on INET && PROC_FS && KPROBES
-	---help---
-	This module allows for capturing the changes to TCP connection
-	state in response to incoming packets. It is used for debugging
-	TCP congestion avoidance modules. If you don't understand
-	what was just said, you don't need it: say N.
-
-	Documentation on how to use TCP connection probing can be found
-	at:
-	
-	  http://www.linuxfoundation.org/collaborate/workgroups/networking/tcpprobe
-
-	To compile this code as a module, choose M here: the
-	module will be called tcp_probe.
-
 config NET_DROP_MONITOR
 	tristate "Network packet drop alerting service"
 	depends on INET && TRACEPOINTS
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index c6c8ad1d4b6d..47a0a6649a9d 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -43,7 +43,6 @@ obj-$(CONFIG_INET_DIAG) += inet_diag.o
 obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
 obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
 obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o
-obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
 obj-$(CONFIG_TCP_CONG_BBR) += tcp_bbr.o
 obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
 obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c
deleted file mode 100644
index 697f4c67b2e3..000000000000
--- a/net/ipv4/tcp_probe.c
+++ /dev/null
@@ -1,301 +0,0 @@
-/*
- * tcpprobe - Observe the TCP flow with kprobes.
- *
- * The idea for this came from Werner Almesberger's umlsim
- * Copyright (C) 2004, Stephen Hemminger <shemminger@osdl.org>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- */
-
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
-#include <linux/kernel.h>
-#include <linux/kprobes.h>
-#include <linux/socket.h>
-#include <linux/tcp.h>
-#include <linux/slab.h>
-#include <linux/proc_fs.h>
-#include <linux/module.h>
-#include <linux/ktime.h>
-#include <linux/time.h>
-#include <net/net_namespace.h>
-
-#include <net/tcp.h>
-
-MODULE_AUTHOR("Stephen Hemminger <shemminger@linux-foundation.org>");
-MODULE_DESCRIPTION("TCP cwnd snooper");
-MODULE_LICENSE("GPL");
-MODULE_VERSION("1.1");
-
-static int port __read_mostly;
-MODULE_PARM_DESC(port, "Port to match (0=all)");
-module_param(port, int, 0);
-
-static unsigned int bufsize __read_mostly = 4096;
-MODULE_PARM_DESC(bufsize, "Log buffer size in packets (4096)");
-module_param(bufsize, uint, 0);
-
-static unsigned int fwmark __read_mostly;
-MODULE_PARM_DESC(fwmark, "skb mark to match (0=no mark)");
-module_param(fwmark, uint, 0);
-
-static int full __read_mostly;
-MODULE_PARM_DESC(full, "Full log (1=every ack packet received,  0=only cwnd changes)");
-module_param(full, int, 0);
-
-static const char procname[] = "tcpprobe";
-
-struct tcp_log {
-	ktime_t tstamp;
-	union {
-		struct sockaddr		raw;
-		struct sockaddr_in	v4;
-		struct sockaddr_in6	v6;
-	}	src, dst;
-	u16	length;
-	u32	snd_nxt;
-	u32	snd_una;
-	u32	snd_wnd;
-	u32	rcv_wnd;
-	u32	snd_cwnd;
-	u32	ssthresh;
-	u32	srtt;
-};
-
-static struct {
-	spinlock_t	lock;
-	wait_queue_head_t wait;
-	ktime_t		start;
-	u32		lastcwnd;
-
-	unsigned long	head, tail;
-	struct tcp_log	*log;
-} tcp_probe;
-
-static inline int tcp_probe_used(void)
-{
-	return (tcp_probe.head - tcp_probe.tail) & (bufsize - 1);
-}
-
-static inline int tcp_probe_avail(void)
-{
-	return bufsize - tcp_probe_used() - 1;
-}
-
-#define tcp_probe_copy_fl_to_si4(inet, si4, mem)		\
-	do {							\
-		si4.sin_family = AF_INET;			\
-		si4.sin_port = inet->inet_##mem##port;		\
-		si4.sin_addr.s_addr = inet->inet_##mem##addr;	\
-	} while (0)						\
-
-/*
- * Hook inserted to be called before each receive packet.
- * Note: arguments must match tcp_rcv_established()!
- */
-static void jtcp_rcv_established(struct sock *sk, struct sk_buff *skb,
-				 const struct tcphdr *th)
-{
-	unsigned int len = skb->len;
-	const struct tcp_sock *tp = tcp_sk(sk);
-	const struct inet_sock *inet = inet_sk(sk);
-
-	/* Only update if port or skb mark matches */
-	if (((port == 0 && fwmark == 0) ||
-	     ntohs(inet->inet_dport) == port ||
-	     ntohs(inet->inet_sport) == port ||
-	     (fwmark > 0 && skb->mark == fwmark)) &&
-	    (full || tp->snd_cwnd != tcp_probe.lastcwnd)) {
-
-		spin_lock(&tcp_probe.lock);
-		/* If log fills, just silently drop */
-		if (tcp_probe_avail() > 1) {
-			struct tcp_log *p = tcp_probe.log + tcp_probe.head;
-
-			p->tstamp = ktime_get();
-			switch (sk->sk_family) {
-			case AF_INET:
-				tcp_probe_copy_fl_to_si4(inet, p->src.v4, s);
-				tcp_probe_copy_fl_to_si4(inet, p->dst.v4, d);
-				break;
-			case AF_INET6:
-				memset(&p->src.v6, 0, sizeof(p->src.v6));
-				memset(&p->dst.v6, 0, sizeof(p->dst.v6));
-#if IS_ENABLED(CONFIG_IPV6)
-				p->src.v6.sin6_family = AF_INET6;
-				p->src.v6.sin6_port = inet->inet_sport;
-				p->src.v6.sin6_addr = inet6_sk(sk)->saddr;
-
-				p->dst.v6.sin6_family = AF_INET6;
-				p->dst.v6.sin6_port = inet->inet_dport;
-				p->dst.v6.sin6_addr = sk->sk_v6_daddr;
-#endif
-				break;
-			default:
-				BUG();
-			}
-
-			p->length = len;
-			p->snd_nxt = tp->snd_nxt;
-			p->snd_una = tp->snd_una;
-			p->snd_cwnd = tp->snd_cwnd;
-			p->snd_wnd = tp->snd_wnd;
-			p->rcv_wnd = tp->rcv_wnd;
-			p->ssthresh = tcp_current_ssthresh(sk);
-			p->srtt = tp->srtt_us >> 3;
-
-			tcp_probe.head = (tcp_probe.head + 1) & (bufsize - 1);
-		}
-		tcp_probe.lastcwnd = tp->snd_cwnd;
-		spin_unlock(&tcp_probe.lock);
-
-		wake_up(&tcp_probe.wait);
-	}
-
-	jprobe_return();
-}
-
-static struct jprobe tcp_jprobe = {
-	.kp = {
-		.symbol_name	= "tcp_rcv_established",
-	},
-	.entry	= jtcp_rcv_established,
-};
-
-static int tcpprobe_open(struct inode *inode, struct file *file)
-{
-	/* Reset (empty) log */
-	spin_lock_bh(&tcp_probe.lock);
-	tcp_probe.head = tcp_probe.tail = 0;
-	tcp_probe.start = ktime_get();
-	spin_unlock_bh(&tcp_probe.lock);
-
-	return 0;
-}
-
-static int tcpprobe_sprint(char *tbuf, int n)
-{
-	const struct tcp_log *p
-		= tcp_probe.log + tcp_probe.tail;
-	struct timespec64 ts
-		= ktime_to_timespec64(ktime_sub(p->tstamp, tcp_probe.start));
-
-	return scnprintf(tbuf, n,
-			"%lu.%09lu %pISpc %pISpc %d %#x %#x %u %u %u %u %u\n",
-			(unsigned long)ts.tv_sec,
-			(unsigned long)ts.tv_nsec,
-			&p->src, &p->dst, p->length, p->snd_nxt, p->snd_una,
-			p->snd_cwnd, p->ssthresh, p->snd_wnd, p->srtt, p->rcv_wnd);
-}
-
-static ssize_t tcpprobe_read(struct file *file, char __user *buf,
-			     size_t len, loff_t *ppos)
-{
-	int error = 0;
-	size_t cnt = 0;
-
-	if (!buf)
-		return -EINVAL;
-
-	while (cnt < len) {
-		char tbuf[256];
-		int width;
-
-		/* Wait for data in buffer */
-		error = wait_event_interruptible(tcp_probe.wait,
-						 tcp_probe_used() > 0);
-		if (error)
-			break;
-
-		spin_lock_bh(&tcp_probe.lock);
-		if (tcp_probe.head == tcp_probe.tail) {
-			/* multiple readers race? */
-			spin_unlock_bh(&tcp_probe.lock);
-			continue;
-		}
-
-		width = tcpprobe_sprint(tbuf, sizeof(tbuf));
-
-		if (cnt + width < len)
-			tcp_probe.tail = (tcp_probe.tail + 1) & (bufsize - 1);
-
-		spin_unlock_bh(&tcp_probe.lock);
-
-		/* if record greater than space available
-		   return partial buffer (so far) */
-		if (cnt + width >= len)
-			break;
-
-		if (copy_to_user(buf + cnt, tbuf, width))
-			return -EFAULT;
-		cnt += width;
-	}
-
-	return cnt == 0 ? error : cnt;
-}
-
-static const struct file_operations tcpprobe_fops = {
-	.owner	 = THIS_MODULE,
-	.open	 = tcpprobe_open,
-	.read    = tcpprobe_read,
-	.llseek  = noop_llseek,
-};
-
-static __init int tcpprobe_init(void)
-{
-	int ret = -ENOMEM;
-
-	/* Warning: if the function signature of tcp_rcv_established,
-	 * has been changed, you also have to change the signature of
-	 * jtcp_rcv_established, otherwise you end up right here!
-	 */
-	BUILD_BUG_ON(__same_type(tcp_rcv_established,
-				 jtcp_rcv_established) == 0);
-
-	init_waitqueue_head(&tcp_probe.wait);
-	spin_lock_init(&tcp_probe.lock);
-
-	if (bufsize == 0)
-		return -EINVAL;
-
-	bufsize = roundup_pow_of_two(bufsize);
-	tcp_probe.log = kcalloc(bufsize, sizeof(struct tcp_log), GFP_KERNEL);
-	if (!tcp_probe.log)
-		goto err0;
-
-	if (!proc_create(procname, S_IRUSR, init_net.proc_net, &tcpprobe_fops))
-		goto err0;
-
-	ret = register_jprobe(&tcp_jprobe);
-	if (ret)
-		goto err1;
-
-	pr_info("probe registered (port=%d/fwmark=%u) bufsize=%u\n",
-		port, fwmark, bufsize);
-	return 0;
- err1:
-	remove_proc_entry(procname, init_net.proc_net);
- err0:
-	kfree(tcp_probe.log);
-	return ret;
-}
-module_init(tcpprobe_init);
-
-static __exit void tcpprobe_exit(void)
-{
-	remove_proc_entry(procname, init_net.proc_net);
-	unregister_jprobe(&tcp_jprobe);
-	kfree(tcp_probe.log);
-}
-module_exit(tcpprobe_exit);

^ permalink raw reply related

* [PATCH net-next v6 3/6] net: sctp: Add SCTP ACK tracking trace event
From: Masami Hiramatsu @ 2017-12-28  6:11 UTC (permalink / raw)
  To: Ingo Molnar, David S . Miller, Ian McDonald, Vlad Yasevich,
	Stephen Hemminger, Steven Rostedt
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, H . Peter Anvin,
	Gerrit Renker, Neil Horman, dccp, netdev, linux-sctp,
	Stephen Rothwell, mhiramat
In-Reply-To: <151444140014.11331.17287122201510506083.stgit@devbox>

Add SCTP ACK tracking trace event to trace the changes of SCTP
association state in response to incoming packets.
It is used for debugging SCTP congestion control algorithms,
and will replace sctp_probe module.

Note that this event a bit tricky. Since this consists of 2
events (sctp_probe and sctp_probe_path) so you have to enable
both events as below.

  # cd /sys/kernel/debug/tracing
  # echo 1 > events/sctp/sctp_probe/enable
  # echo 1 > events/sctp/sctp_probe_path/enable

Or, you can enable all the events under sctp.

  # echo 1 > events/sctp/enable

Since sctp_probe_path event is always invoked from sctp_probe
event, you can not see any output if you only enable
sctp_probe_path.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
---
  Changes in v3:
   - Add checking whether sctp_probe_path event is enabled
     before iterating sctp paths to record. Thanks Steven.
  Changes in v4:
   - Move a temporal variable definition in the block.
   - Fix to cast pointer to unsigned long instead of __u64
     for 32bit environment.
---
 include/trace/events/sctp.h |   99 +++++++++++++++++++++++++++++++++++++++++++
 net/sctp/sm_statefuns.c     |    5 ++
 2 files changed, 104 insertions(+)
 create mode 100644 include/trace/events/sctp.h

diff --git a/include/trace/events/sctp.h b/include/trace/events/sctp.h
new file mode 100644
index 000000000000..7475c7be165a
--- /dev/null
+++ b/include/trace/events/sctp.h
@@ -0,0 +1,99 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM sctp
+
+#if !defined(_TRACE_SCTP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_SCTP_H
+
+#include <net/sctp/structs.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(sctp_probe_path,
+
+	TP_PROTO(struct sctp_transport *sp,
+		 const struct sctp_association *asoc),
+
+	TP_ARGS(sp, asoc),
+
+	TP_STRUCT__entry(
+		__field(__u64, asoc)
+		__field(__u32, primary)
+		__array(__u8, ipaddr, sizeof(union sctp_addr))
+		__field(__u32, state)
+		__field(__u32, cwnd)
+		__field(__u32, ssthresh)
+		__field(__u32, flight_size)
+		__field(__u32, partial_bytes_acked)
+		__field(__u32, pathmtu)
+	),
+
+	TP_fast_assign(
+		__entry->asoc = (unsigned long)asoc;
+		__entry->primary = (sp == asoc->peer.primary_path);
+		memcpy(__entry->ipaddr, &sp->ipaddr, sizeof(union sctp_addr));
+		__entry->state = sp->state;
+		__entry->cwnd = sp->cwnd;
+		__entry->ssthresh = sp->ssthresh;
+		__entry->flight_size = sp->flight_size;
+		__entry->partial_bytes_acked = sp->partial_bytes_acked;
+		__entry->pathmtu = sp->pathmtu;
+	),
+
+	TP_printk("asoc=%#llx%s ipaddr=%pISpc state=%u cwnd=%u ssthresh=%u "
+		  "flight_size=%u partial_bytes_acked=%u pathmtu=%u",
+		  __entry->asoc, __entry->primary ? "(*)" : "",
+		  __entry->ipaddr, __entry->state, __entry->cwnd,
+		  __entry->ssthresh, __entry->flight_size,
+		  __entry->partial_bytes_acked, __entry->pathmtu)
+);
+
+TRACE_EVENT(sctp_probe,
+
+	TP_PROTO(const struct sctp_endpoint *ep,
+		 const struct sctp_association *asoc,
+		 struct sctp_chunk *chunk),
+
+	TP_ARGS(ep, asoc, chunk),
+
+	TP_STRUCT__entry(
+		__field(__u64, asoc)
+		__field(__u32, mark)
+		__field(__u16, bind_port)
+		__field(__u16, peer_port)
+		__field(__u32, pathmtu)
+		__field(__u32, rwnd)
+		__field(__u16, unack_data)
+	),
+
+	TP_fast_assign(
+		struct sk_buff *skb = chunk->skb;
+
+		__entry->asoc = (unsigned long)asoc;
+		__entry->mark = skb->mark;
+		__entry->bind_port = ep->base.bind_addr.port;
+		__entry->peer_port = asoc->peer.port;
+		__entry->pathmtu = asoc->pathmtu;
+		__entry->rwnd = asoc->peer.rwnd;
+		__entry->unack_data = asoc->unack_data;
+
+		if (trace_sctp_probe_path_enabled()) {
+			struct sctp_transport *sp;
+
+			list_for_each_entry(sp, &asoc->peer.transport_addr_list,
+					    transports) {
+				trace_sctp_probe_path(sp, asoc);
+			}
+		}
+	),
+
+	TP_printk("asoc=%#llx mark=%#x bind_port=%d peer_port=%d pathmtu=%d "
+		  "rwnd=%u unack_data=%d",
+		  __entry->asoc, __entry->mark, __entry->bind_port,
+		  __entry->peer_port, __entry->pathmtu, __entry->rwnd,
+		  __entry->unack_data)
+);
+
+#endif /* _TRACE_SCTP_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 541f34735346..eb7905ffe5f2 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -59,6 +59,9 @@
 #include <net/sctp/sm.h>
 #include <net/sctp/structs.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/sctp.h>
+
 static struct sctp_packet *sctp_abort_pkt_new(
 					struct net *net,
 					const struct sctp_endpoint *ep,
@@ -3219,6 +3222,8 @@ enum sctp_disposition sctp_sf_eat_sack_6_2(struct net *net,
 	struct sctp_sackhdr *sackh;
 	__u32 ctsn;
 
+	trace_sctp_probe(ep, asoc, chunk);
+
 	if (!sctp_vtag_verify(chunk, asoc))
 		return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
 

^ permalink raw reply related

* [PATCH net-next v6 4/6] net: sctp: Remove debug SCTP probe module
From: Masami Hiramatsu @ 2017-12-28  6:12 UTC (permalink / raw)
  To: Ingo Molnar, David S . Miller, Ian McDonald, Vlad Yasevich,
	Stephen Hemminger, Steven Rostedt
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, H . Peter Anvin,
	Gerrit Renker, Neil Horman, dccp, netdev, linux-sctp,
	Stephen Rothwell, mhiramat
In-Reply-To: <151444140014.11331.17287122201510506083.stgit@devbox>

Remove SCTP probe module since jprobe has been deprecated.
That function is now replaced by sctp/sctp_probe and
sctp/sctp_probe_path trace-events.
You can use it via ftrace or perftools.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
---
 net/sctp/Kconfig  |   12 ---
 net/sctp/Makefile |    3 -
 net/sctp/probe.c  |  244 -----------------------------------------------------
 3 files changed, 259 deletions(-)
 delete mode 100644 net/sctp/probe.c

diff --git a/net/sctp/Kconfig b/net/sctp/Kconfig
index d9c04dc1b3f3..c740b189d4ba 100644
--- a/net/sctp/Kconfig
+++ b/net/sctp/Kconfig
@@ -37,18 +37,6 @@ menuconfig IP_SCTP
 
 if IP_SCTP
 
-config NET_SCTPPROBE
-	tristate "SCTP: Association probing"
-        depends on PROC_FS && KPROBES
-        ---help---
-        This module allows for capturing the changes to SCTP association
-        state in response to incoming packets. It is used for debugging
-        SCTP congestion control algorithms. If you don't understand
-        what was just said, you don't need it: say N.
-
-        To compile this code as a module, choose M here: the
-        module will be called sctp_probe.
-
 config SCTP_DBG_OBJCNT
 	bool "SCTP: Debug object counts"
 	depends on PROC_FS
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 54bd9c1a8aa1..6776582ec449 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -4,7 +4,6 @@
 #
 
 obj-$(CONFIG_IP_SCTP) += sctp.o
-obj-$(CONFIG_NET_SCTPPROBE) += sctp_probe.o
 obj-$(CONFIG_INET_SCTP_DIAG) += sctp_diag.o
 
 sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
@@ -16,8 +15,6 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
 	  offload.o stream_sched.o stream_sched_prio.o \
 	  stream_sched_rr.o stream_interleave.o
 
-sctp_probe-y := probe.o
-
 sctp-$(CONFIG_SCTP_DBG_OBJCNT) += objcnt.o
 sctp-$(CONFIG_PROC_FS) += proc.o
 sctp-$(CONFIG_SYSCTL) += sysctl.o
diff --git a/net/sctp/probe.c b/net/sctp/probe.c
deleted file mode 100644
index 1280f85a598d..000000000000
--- a/net/sctp/probe.c
+++ /dev/null
@@ -1,244 +0,0 @@
-/*
- * sctp_probe - Observe the SCTP flow with kprobes.
- *
- * The idea for this came from Werner Almesberger's umlsim
- * Copyright (C) 2004, Stephen Hemminger <shemminger@osdl.org>
- *
- * Modified for SCTP from Stephen Hemminger's code
- * Copyright (C) 2010, Wei Yongjun <yjwei@cn.fujitsu.com>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- */
-
-#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
-
-#include <linux/kernel.h>
-#include <linux/kprobes.h>
-#include <linux/socket.h>
-#include <linux/sctp.h>
-#include <linux/proc_fs.h>
-#include <linux/vmalloc.h>
-#include <linux/module.h>
-#include <linux/kfifo.h>
-#include <linux/time.h>
-#include <net/net_namespace.h>
-
-#include <net/sctp/sctp.h>
-#include <net/sctp/sm.h>
-
-MODULE_SOFTDEP("pre: sctp");
-MODULE_AUTHOR("Wei Yongjun <yjwei@cn.fujitsu.com>");
-MODULE_DESCRIPTION("SCTP snooper");
-MODULE_LICENSE("GPL");
-
-static int port __read_mostly = 0;
-MODULE_PARM_DESC(port, "Port to match (0=all)");
-module_param(port, int, 0);
-
-static unsigned int fwmark __read_mostly = 0;
-MODULE_PARM_DESC(fwmark, "skb mark to match (0=no mark)");
-module_param(fwmark, uint, 0);
-
-static int bufsize __read_mostly = 64 * 1024;
-MODULE_PARM_DESC(bufsize, "Log buffer size (default 64k)");
-module_param(bufsize, int, 0);
-
-static int full __read_mostly = 1;
-MODULE_PARM_DESC(full, "Full log (1=every ack packet received,  0=only cwnd changes)");
-module_param(full, int, 0);
-
-static const char procname[] = "sctpprobe";
-
-static struct {
-	struct kfifo	  fifo;
-	spinlock_t	  lock;
-	wait_queue_head_t wait;
-	struct timespec64 tstart;
-} sctpw;
-
-static __printf(1, 2) void printl(const char *fmt, ...)
-{
-	va_list args;
-	int len;
-	char tbuf[256];
-
-	va_start(args, fmt);
-	len = vscnprintf(tbuf, sizeof(tbuf), fmt, args);
-	va_end(args);
-
-	kfifo_in_locked(&sctpw.fifo, tbuf, len, &sctpw.lock);
-	wake_up(&sctpw.wait);
-}
-
-static int sctpprobe_open(struct inode *inode, struct file *file)
-{
-	kfifo_reset(&sctpw.fifo);
-	ktime_get_ts64(&sctpw.tstart);
-
-	return 0;
-}
-
-static ssize_t sctpprobe_read(struct file *file, char __user *buf,
-			      size_t len, loff_t *ppos)
-{
-	int error = 0, cnt = 0;
-	unsigned char *tbuf;
-
-	if (!buf)
-		return -EINVAL;
-
-	if (len == 0)
-		return 0;
-
-	tbuf = vmalloc(len);
-	if (!tbuf)
-		return -ENOMEM;
-
-	error = wait_event_interruptible(sctpw.wait,
-					 kfifo_len(&sctpw.fifo) != 0);
-	if (error)
-		goto out_free;
-
-	cnt = kfifo_out_locked(&sctpw.fifo, tbuf, len, &sctpw.lock);
-	error = copy_to_user(buf, tbuf, cnt) ? -EFAULT : 0;
-
-out_free:
-	vfree(tbuf);
-
-	return error ? error : cnt;
-}
-
-static const struct file_operations sctpprobe_fops = {
-	.owner	= THIS_MODULE,
-	.open	= sctpprobe_open,
-	.read	= sctpprobe_read,
-	.llseek = noop_llseek,
-};
-
-static enum sctp_disposition jsctp_sf_eat_sack(
-					struct net *net,
-					const struct sctp_endpoint *ep,
-					const struct sctp_association *asoc,
-					const union sctp_subtype type,
-					void *arg,
-					struct sctp_cmd_seq *commands)
-{
-	struct sctp_chunk *chunk = arg;
-	struct sk_buff *skb = chunk->skb;
-	struct sctp_transport *sp;
-	static __u32 lcwnd = 0;
-	struct timespec64 now;
-
-	sp = asoc->peer.primary_path;
-
-	if (((port == 0 && fwmark == 0) ||
-	     asoc->peer.port == port ||
-	     ep->base.bind_addr.port == port ||
-	     (fwmark > 0 && skb->mark == fwmark)) &&
-	    (full || sp->cwnd != lcwnd)) {
-		lcwnd = sp->cwnd;
-
-		ktime_get_ts64(&now);
-		now = timespec64_sub(now, sctpw.tstart);
-
-		printl("%lu.%06lu ", (unsigned long) now.tv_sec,
-		       (unsigned long) now.tv_nsec / NSEC_PER_USEC);
-
-		printl("%p %5d %5d %5d %8d %5d ", asoc,
-		       ep->base.bind_addr.port, asoc->peer.port,
-		       asoc->pathmtu, asoc->peer.rwnd, asoc->unack_data);
-
-		list_for_each_entry(sp, &asoc->peer.transport_addr_list,
-					transports) {
-			if (sp == asoc->peer.primary_path)
-				printl("*");
-
-			printl("%pISc %2u %8u %8u %8u %8u %8u ",
-			       &sp->ipaddr, sp->state, sp->cwnd, sp->ssthresh,
-			       sp->flight_size, sp->partial_bytes_acked,
-			       sp->pathmtu);
-		}
-		printl("\n");
-	}
-
-	jprobe_return();
-	return 0;
-}
-
-static struct jprobe sctp_recv_probe = {
-	.kp	= {
-		.symbol_name = "sctp_sf_eat_sack_6_2",
-	},
-	.entry	= jsctp_sf_eat_sack,
-};
-
-static __init int sctp_setup_jprobe(void)
-{
-	int ret = register_jprobe(&sctp_recv_probe);
-
-	if (ret) {
-		if (request_module("sctp"))
-			goto out;
-		ret = register_jprobe(&sctp_recv_probe);
-	}
-
-out:
-	return ret;
-}
-
-static __init int sctpprobe_init(void)
-{
-	int ret = -ENOMEM;
-
-	/* Warning: if the function signature of sctp_sf_eat_sack_6_2,
-	 * has been changed, you also have to change the signature of
-	 * jsctp_sf_eat_sack, otherwise you end up right here!
-	 */
-	BUILD_BUG_ON(__same_type(sctp_sf_eat_sack_6_2,
-				 jsctp_sf_eat_sack) == 0);
-
-	init_waitqueue_head(&sctpw.wait);
-	spin_lock_init(&sctpw.lock);
-	if (kfifo_alloc(&sctpw.fifo, bufsize, GFP_KERNEL))
-		return ret;
-
-	if (!proc_create(procname, S_IRUSR, init_net.proc_net,
-			 &sctpprobe_fops))
-		goto free_kfifo;
-
-	ret = sctp_setup_jprobe();
-	if (ret)
-		goto remove_proc;
-
-	pr_info("probe registered (port=%d/fwmark=%u) bufsize=%u\n",
-		port, fwmark, bufsize);
-	return 0;
-
-remove_proc:
-	remove_proc_entry(procname, init_net.proc_net);
-free_kfifo:
-	kfifo_free(&sctpw.fifo);
-	return ret;
-}
-
-static __exit void sctpprobe_exit(void)
-{
-	kfifo_free(&sctpw.fifo);
-	remove_proc_entry(procname, init_net.proc_net);
-	unregister_jprobe(&sctp_recv_probe);
-}
-
-module_init(sctpprobe_init);
-module_exit(sctpprobe_exit);

^ permalink raw reply related

* [PATCH net-next v6 5/6] net: dccp: Add DCCP sendmsg trace event
From: Masami Hiramatsu @ 2017-12-28  6:12 UTC (permalink / raw)
  To: Ingo Molnar, David S . Miller, Ian McDonald, Vlad Yasevich,
	Stephen Hemminger, Steven Rostedt
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, H . Peter Anvin,
	Gerrit Renker, Neil Horman, dccp, netdev, linux-sctp,
	Stephen Rothwell, mhiramat
In-Reply-To: <151444140014.11331.17287122201510506083.stgit@devbox>

Add DCCP sendmsg trace event (dccp/dccp_probe) for
replacing dccpprobe. User can trace this event via
ftrace or perftools.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
---
  Changes in v5
   - Fix to add local directory to include for trace.h.
     Thanks Steven!
---
 net/dccp/Makefile |    3 ++
 net/dccp/proto.c  |    5 +++
 net/dccp/trace.h  |  105 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 113 insertions(+)
 create mode 100644 net/dccp/trace.h

diff --git a/net/dccp/Makefile b/net/dccp/Makefile
index 2e7b56097bc4..4215f13a63af 100644
--- a/net/dccp/Makefile
+++ b/net/dccp/Makefile
@@ -27,3 +27,6 @@ dccp-$(CONFIG_SYSCTL) += sysctl.o
 
 dccp_diag-y := diag.o
 dccp_probe-y := probe.o
+
+# build with local directory for trace.h
+CFLAGS_proto.o := -I$(src)
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 7a75a1d3568b..fa7e92e08920 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -38,6 +38,9 @@
 #include "dccp.h"
 #include "feat.h"
 
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
 DEFINE_SNMP_STAT(struct dccp_mib, dccp_statistics) __read_mostly;
 
 EXPORT_SYMBOL_GPL(dccp_statistics);
@@ -761,6 +764,8 @@ int dccp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	int rc, size;
 	long timeo;
 
+	trace_dccp_probe(sk, len);
+
 	if (len > dp->dccps_mss_cache)
 		return -EMSGSIZE;
 
diff --git a/net/dccp/trace.h b/net/dccp/trace.h
new file mode 100644
index 000000000000..aa01321a6c37
--- /dev/null
+++ b/net/dccp/trace.h
@@ -0,0 +1,105 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dccp
+
+#if !defined(_TRACE_DCCP_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DCCP_H
+
+#include <net/sock.h>
+#include "dccp.h"
+#include "ccids/ccid3.h"
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(dccp_probe,
+
+	TP_PROTO(struct sock *sk, size_t size),
+
+	TP_ARGS(sk, size),
+
+	TP_STRUCT__entry(
+		/* sockaddr_in6 is always bigger than sockaddr_in */
+		__array(__u8, saddr, sizeof(struct sockaddr_in6))
+		__array(__u8, daddr, sizeof(struct sockaddr_in6))
+		__field(__u16, sport)
+		__field(__u16, dport)
+		__field(__u16, size)
+		__field(__u16, tx_s)
+		__field(__u32, tx_rtt)
+		__field(__u32, tx_p)
+		__field(__u32, tx_x_calc)
+		__field(__u64, tx_x_recv)
+		__field(__u64, tx_x)
+		__field(__u32, tx_t_ipi)
+	),
+
+	TP_fast_assign(
+		const struct inet_sock *inet = inet_sk(sk);
+		struct ccid3_hc_tx_sock *hc = NULL;
+
+		if (ccid_get_current_tx_ccid(dccp_sk(sk)) == DCCPC_CCID3)
+			hc = ccid3_hc_tx_sk(sk);
+
+		memset(__entry->saddr, 0, sizeof(struct sockaddr_in6));
+		memset(__entry->daddr, 0, sizeof(struct sockaddr_in6));
+
+		if (sk->sk_family == AF_INET) {
+			struct sockaddr_in *v4 = (void *)__entry->saddr;
+
+			v4->sin_family = AF_INET;
+			v4->sin_port = inet->inet_sport;
+			v4->sin_addr.s_addr = inet->inet_saddr;
+			v4 = (void *)__entry->daddr;
+			v4->sin_family = AF_INET;
+			v4->sin_port = inet->inet_dport;
+			v4->sin_addr.s_addr = inet->inet_daddr;
+#if IS_ENABLED(CONFIG_IPV6)
+		} else if (sk->sk_family == AF_INET6) {
+			struct sockaddr_in6 *v6 = (void *)__entry->saddr;
+
+			v6->sin6_family = AF_INET6;
+			v6->sin6_port = inet->inet_sport;
+			v6->sin6_addr = inet6_sk(sk)->saddr;
+			v6 = (void *)__entry->daddr;
+			v6->sin6_family = AF_INET6;
+			v6->sin6_port = inet->inet_dport;
+			v6->sin6_addr = sk->sk_v6_daddr;
+#endif
+		}
+
+		/* For filtering use */
+		__entry->sport = ntohs(inet->inet_sport);
+		__entry->dport = ntohs(inet->inet_dport);
+
+		__entry->size = size;
+		if (hc) {
+			__entry->tx_s = hc->tx_s;
+			__entry->tx_rtt = hc->tx_rtt;
+			__entry->tx_p = hc->tx_p;
+			__entry->tx_x_calc = hc->tx_x_calc;
+			__entry->tx_x_recv = hc->tx_x_recv >> 6;
+			__entry->tx_x = hc->tx_x >> 6;
+			__entry->tx_t_ipi = hc->tx_t_ipi;
+		} else {
+			__entry->tx_s = 0;
+			memset(&__entry->tx_rtt, 0, (void *)&__entry->tx_t_ipi -
+			       (void *)&__entry->tx_rtt +
+			       sizeof(__entry->tx_t_ipi));
+		}
+	),
+
+	TP_printk("src=%pISpc dest=%pISpc size=%d tx_s=%d tx_rtt=%d "
+		  "tx_p=%d tx_x_calc=%u tx_x_recv=%llu tx_x=%llu tx_t_ipi=%d",
+		  __entry->saddr, __entry->daddr, __entry->size,
+		  __entry->tx_s, __entry->tx_rtt, __entry->tx_p,
+		  __entry->tx_x_calc, __entry->tx_x_recv, __entry->tx_x,
+		  __entry->tx_t_ipi)
+);
+
+#endif /* _TRACE_TCP_H */
+
+/* This part must be outside protection */
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+#include <trace/define_trace.h>

^ permalink raw reply related

* [PATCH net-next v6 6/6] net: dccp: Remove dccpprobe module
From: Masami Hiramatsu @ 2017-12-28  6:13 UTC (permalink / raw)
  To: Ingo Molnar, David S . Miller, Ian McDonald, Vlad Yasevich,
	Stephen Hemminger, Steven Rostedt
  Cc: Peter Zijlstra, Thomas Gleixner, LKML, H . Peter Anvin,
	Gerrit Renker, Neil Horman, dccp, netdev, linux-sctp,
	Stephen Rothwell, mhiramat
In-Reply-To: <151444140014.11331.17287122201510506083.stgit@devbox>

Remove DCCP probe module since jprobe has been deprecated.
That function is now replaced by dccp/dccp_probe trace-event.
You can use it via ftrace or perftools.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
---
 Changes in v5:
  - Fix a conflict with previous change in Makefile.
---
 net/dccp/Kconfig  |   17 ----
 net/dccp/Makefile |    2 -
 net/dccp/probe.c  |  203 -----------------------------------------------------
 3 files changed, 222 deletions(-)
 delete mode 100644 net/dccp/probe.c

diff --git a/net/dccp/Kconfig b/net/dccp/Kconfig
index 8c0ef71bed2f..b270e84d9c13 100644
--- a/net/dccp/Kconfig
+++ b/net/dccp/Kconfig
@@ -39,23 +39,6 @@ config IP_DCCP_DEBUG
 
 	  Just say N.
 
-config NET_DCCPPROBE
-	tristate "DCCP connection probing"
-	depends on PROC_FS && KPROBES
-	---help---
-	This module allows for capturing the changes to DCCP connection
-	state in response to incoming packets. It is used for debugging
-	DCCP congestion avoidance modules. If you don't understand
-	what was just said, you don't need it: say N.
-
-	Documentation on how to use DCCP connection probing can be found
-	at:
-	
-	  http://www.linuxfoundation.org/collaborate/workgroups/networking/dccpprobe
-
-	To compile this code as a module, choose M here: the
-	module will be called dccp_probe.
-
 
 endmenu
 
diff --git a/net/dccp/Makefile b/net/dccp/Makefile
index 4215f13a63af..5b4ff37bc806 100644
--- a/net/dccp/Makefile
+++ b/net/dccp/Makefile
@@ -21,12 +21,10 @@ obj-$(subst y,$(CONFIG_IP_DCCP),$(CONFIG_IPV6)) += dccp_ipv6.o
 dccp_ipv6-y := ipv6.o
 
 obj-$(CONFIG_INET_DCCP_DIAG) += dccp_diag.o
-obj-$(CONFIG_NET_DCCPPROBE) += dccp_probe.o
 
 dccp-$(CONFIG_SYSCTL) += sysctl.o
 
 dccp_diag-y := diag.o
-dccp_probe-y := probe.o
 
 # build with local directory for trace.h
 CFLAGS_proto.o := -I$(src)
diff --git a/net/dccp/probe.c b/net/dccp/probe.c
deleted file mode 100644
index 3d3fda05b32d..000000000000
--- a/net/dccp/probe.c
+++ /dev/null
@@ -1,203 +0,0 @@
-/*
- * dccp_probe - Observe the DCCP flow with kprobes.
- *
- * The idea for this came from Werner Almesberger's umlsim
- * Copyright (C) 2004, Stephen Hemminger <shemminger@osdl.org>
- *
- * Modified for DCCP from Stephen Hemminger's code
- * Copyright (C) 2006, Ian McDonald <ian.mcdonald@jandi.co.nz>
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- */
-
-#include <linux/kernel.h>
-#include <linux/kprobes.h>
-#include <linux/socket.h>
-#include <linux/dccp.h>
-#include <linux/proc_fs.h>
-#include <linux/module.h>
-#include <linux/kfifo.h>
-#include <linux/vmalloc.h>
-#include <linux/time64.h>
-#include <linux/gfp.h>
-#include <net/net_namespace.h>
-
-#include "dccp.h"
-#include "ccid.h"
-#include "ccids/ccid3.h"
-
-static int port;
-
-static int bufsize = 64 * 1024;
-
-static const char procname[] = "dccpprobe";
-
-static struct {
-	struct kfifo	  fifo;
-	spinlock_t	  lock;
-	wait_queue_head_t wait;
-	struct timespec64 tstart;
-} dccpw;
-
-static void printl(const char *fmt, ...)
-{
-	va_list args;
-	int len;
-	struct timespec64 now;
-	char tbuf[256];
-
-	va_start(args, fmt);
-	getnstimeofday64(&now);
-
-	now = timespec64_sub(now, dccpw.tstart);
-
-	len = sprintf(tbuf, "%lu.%06lu ",
-		      (unsigned long) now.tv_sec,
-		      (unsigned long) now.tv_nsec / NSEC_PER_USEC);
-	len += vscnprintf(tbuf+len, sizeof(tbuf)-len, fmt, args);
-	va_end(args);
-
-	kfifo_in_locked(&dccpw.fifo, tbuf, len, &dccpw.lock);
-	wake_up(&dccpw.wait);
-}
-
-static int jdccp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
-{
-	const struct inet_sock *inet = inet_sk(sk);
-	struct ccid3_hc_tx_sock *hc = NULL;
-
-	if (ccid_get_current_tx_ccid(dccp_sk(sk)) == DCCPC_CCID3)
-		hc = ccid3_hc_tx_sk(sk);
-
-	if (port == 0 || ntohs(inet->inet_dport) == port ||
-	    ntohs(inet->inet_sport) == port) {
-		if (hc)
-			printl("%pI4:%u %pI4:%u %d %d %d %d %u %llu %llu %d\n",
-			       &inet->inet_saddr, ntohs(inet->inet_sport),
-			       &inet->inet_daddr, ntohs(inet->inet_dport), size,
-			       hc->tx_s, hc->tx_rtt, hc->tx_p,
-			       hc->tx_x_calc, hc->tx_x_recv >> 6,
-			       hc->tx_x >> 6, hc->tx_t_ipi);
-		else
-			printl("%pI4:%u %pI4:%u %d\n",
-			       &inet->inet_saddr, ntohs(inet->inet_sport),
-			       &inet->inet_daddr, ntohs(inet->inet_dport),
-			       size);
-	}
-
-	jprobe_return();
-	return 0;
-}
-
-static struct jprobe dccp_send_probe = {
-	.kp	= {
-		.symbol_name = "dccp_sendmsg",
-	},
-	.entry	= jdccp_sendmsg,
-};
-
-static int dccpprobe_open(struct inode *inode, struct file *file)
-{
-	kfifo_reset(&dccpw.fifo);
-	getnstimeofday64(&dccpw.tstart);
-	return 0;
-}
-
-static ssize_t dccpprobe_read(struct file *file, char __user *buf,
-			      size_t len, loff_t *ppos)
-{
-	int error = 0, cnt = 0;
-	unsigned char *tbuf;
-
-	if (!buf)
-		return -EINVAL;
-
-	if (len == 0)
-		return 0;
-
-	tbuf = vmalloc(len);
-	if (!tbuf)
-		return -ENOMEM;
-
-	error = wait_event_interruptible(dccpw.wait,
-					 kfifo_len(&dccpw.fifo) != 0);
-	if (error)
-		goto out_free;
-
-	cnt = kfifo_out_locked(&dccpw.fifo, tbuf, len, &dccpw.lock);
-	error = copy_to_user(buf, tbuf, cnt) ? -EFAULT : 0;
-
-out_free:
-	vfree(tbuf);
-
-	return error ? error : cnt;
-}
-
-static const struct file_operations dccpprobe_fops = {
-	.owner	 = THIS_MODULE,
-	.open	 = dccpprobe_open,
-	.read    = dccpprobe_read,
-	.llseek  = noop_llseek,
-};
-
-static __init int dccpprobe_init(void)
-{
-	int ret = -ENOMEM;
-
-	init_waitqueue_head(&dccpw.wait);
-	spin_lock_init(&dccpw.lock);
-	if (kfifo_alloc(&dccpw.fifo, bufsize, GFP_KERNEL))
-		return ret;
-	if (!proc_create(procname, S_IRUSR, init_net.proc_net, &dccpprobe_fops))
-		goto err0;
-
-	ret = register_jprobe(&dccp_send_probe);
-	if (ret) {
-		ret = request_module("dccp");
-		if (!ret)
-			ret = register_jprobe(&dccp_send_probe);
-	}
-
-	if (ret)
-		goto err1;
-
-	pr_info("DCCP watch registered (port=%d)\n", port);
-	return 0;
-err1:
-	remove_proc_entry(procname, init_net.proc_net);
-err0:
-	kfifo_free(&dccpw.fifo);
-	return ret;
-}
-module_init(dccpprobe_init);
-
-static __exit void dccpprobe_exit(void)
-{
-	kfifo_free(&dccpw.fifo);
-	remove_proc_entry(procname, init_net.proc_net);
-	unregister_jprobe(&dccp_send_probe);
-
-}
-module_exit(dccpprobe_exit);
-
-MODULE_PARM_DESC(port, "Port to match (0=all)");
-module_param(port, int, 0);
-
-MODULE_PARM_DESC(bufsize, "Log buffer size (default 64k)");
-module_param(bufsize, int, 0);
-
-MODULE_AUTHOR("Ian McDonald <ian.mcdonald@jandi.co.nz>");
-MODULE_DESCRIPTION("DCCP snooper");
-MODULE_LICENSE("GPL");

^ permalink raw reply related

* [PATCH net-next] cxgb4/cxgb4vf: support for XLAUI Port Type
From: Ganesh Goudar @ 2017-12-28  6:37 UTC (permalink / raw)
  To: netdev, davem; +Cc: nirranjan, indranil, venkatesh, Ganesh Goudar, Casey Leedom

Add support for new Backplane XLAUI port type.

Signed-off-by: Casey Leedom <leedom@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c  | 10 +++++++++-
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c          |  1 +
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h       |  1 +
 drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 10 +++++++++-
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
index 541419b..7852d98 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_ethtool.c
@@ -517,7 +517,8 @@ static int from_fw_port_mod_type(enum fw_port_type port_type,
 		else
 			return PORT_OTHER;
 	} else if (port_type == FW_PORT_TYPE_KR4_100G ||
-		   port_type == FW_PORT_TYPE_KR_SFP28) {
+		   port_type == FW_PORT_TYPE_KR_SFP28 ||
+		   port_type == FW_PORT_TYPE_KR_XLAUI) {
 		return PORT_NONE;
 	}
 
@@ -645,6 +646,13 @@ static void fw_caps_to_lmm(enum fw_port_type port_type,
 		FW_CAPS_TO_LMM(SPEED_25G, 25000baseKR_Full);
 		break;
 
+	case FW_PORT_TYPE_KR_XLAUI:
+		SET_LMM(Backplane);
+		FW_CAPS_TO_LMM(SPEED_1G, 1000baseKX_Full);
+		FW_CAPS_TO_LMM(SPEED_10G, 10000baseKR_Full);
+		FW_CAPS_TO_LMM(SPEED_40G, 40000baseKR4_Full);
+		break;
+
 	case FW_PORT_TYPE_CR2_QSFP:
 		SET_LMM(FIBRE);
 		SET_LMM(50000baseSR2_Full);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index 44930ca..f2a60e0 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -6084,6 +6084,7 @@ const char *t4_get_port_type_description(enum fw_port_type port_type)
 		"CR2_QSFP",
 		"SFP28",
 		"KR_SFP28",
+		"KR_XLAUI"
 	};
 
 	if (port_type < ARRAY_SIZE(port_type_description))
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
index 01f5a5e..427f252 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
@@ -2829,6 +2829,7 @@ enum fw_port_type {
 	FW_PORT_TYPE_CR2_QSFP,
 	FW_PORT_TYPE_SFP28,
 	FW_PORT_TYPE_KR_SFP28,
+	FW_PORT_TYPE_KR_XLAUI,
 
 	FW_PORT_TYPE_NONE = FW_PORT_CMD_PTYPE_M
 };
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index b48361c..96f69f8 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -1229,7 +1229,8 @@ static int from_fw_port_mod_type(enum fw_port_type port_type,
 		else
 			return PORT_OTHER;
 	} else if (port_type == FW_PORT_TYPE_KR4_100G ||
-		   port_type == FW_PORT_TYPE_KR_SFP28) {
+		   port_type == FW_PORT_TYPE_KR_SFP28 ||
+		   port_type == FW_PORT_TYPE_KR_XLAUI) {
 		return PORT_NONE;
 	}
 
@@ -1323,6 +1324,13 @@ static void fw_caps_to_lmm(enum fw_port_type port_type,
 		SET_LMM(25000baseKR_Full);
 		break;
 
+	case FW_PORT_TYPE_KR_XLAUI:
+		SET_LMM(Backplane);
+		FW_CAPS_TO_LMM(SPEED_1G, 1000baseKX_Full);
+		FW_CAPS_TO_LMM(SPEED_10G, 10000baseKR_Full);
+		FW_CAPS_TO_LMM(SPEED_40G, 40000baseKR4_Full);
+		break;
+
 	case FW_PORT_TYPE_CR2_QSFP:
 		SET_LMM(FIBRE);
 		SET_LMM(50000baseSR2_Full);
-- 
2.1.0

^ permalink raw reply related

* [PATCH net-next v9 0/2] add UniPhier AVE ethernet support
From: Kunihiko Hayashi @ 2017-12-28  6:58 UTC (permalink / raw)
  To: David Miller, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Andrew Lunn, Florian Fainelli, Rob Herring, Mark Rutland,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Masahiro Yamada,
	Masami Hiramatsu, Jassi Brar, Kunihiko Hayashi

This series adds support for Socionext AVE ethernet controller implemented
on UniPhier SoCs. This driver supports RGMII/RMII modes.

v8: https://www.spinics.net/lists/netdev/msg474374.html

The PHY patch included in v1 has already separated in:
http://www.spinics.net/lists/netdev/msg454595.html

Changes since v8:
- move operators at the beginning of the line to the end of the previous line
- dt-bindings: add blank lines before mdio and phy subnodes

Changes since v7:
- dt-bindings: fix mdio subnode description

Changes since v6:
- sort the order of local variables from longest to shortest line
- fix ave_probe() which calls register_netdev() at the end of initialization
- dt-bindings: remove phy node descriptions in mdio node

Changes since v5:
- replace license boilerplate with SPDX Identifier
- remove inline directives and an unused function

Changes since v4:
- fix larger integer warning on AVE_PFMBYTE_MASK0

Changes since v3:
- remove checking dma address and use dma_set_mask() to restirct address
- replace ave_mdio_busywait() with read_poll_timeout()
- replace functions to access to registers with readl/writel() directly
- replace a function to access to macaddr with ave_hw_write_macaddr()
- change return value of ave_dma_map() to error value
- move mdiobus_unregister() from ave_remove() to ave_uninit()
- eliminate else block at the end of ave_dma_map()
- add mask definitions for packet filter
- sort bitmap definitions in descending order
- add error check to some functions
- rename and sort functions to clear sub-categories
- fix error value consistency
- remove unneeded initializers
- change type of constant arrays

Changes since v2:
- replace clk_get() with devm_clk_get()
- replace reset_control_get() with devm_reset_control_get_optional_shared()
- add error return when the error occurs on the above *_get functions
- sort soc data and compatible strings
- remove clearly obvious comments
- modify dt-bindings document consistent with these modifications

Changes since v1:
- add/remove devicetree properties and sub-node
  - remove "internal-phy-interrupt" and "desc-bits" property
  - add SoC data structures based on compatible strings
  - add node operation to apply "mdio" sub-node
- add support for features
  - add support for {get,set}_pauseparam and pause frame operations
  - add support for ndo_get_stats64 instead of ndo_get_stats
- replace with desiable functions
  - replace check for valid phy_mode with phy_interface{_mode}_is_rgmii()
  - replace phy attach message with phy_attached_info()
  - replace 32bit operation with {upper,lower}_32_bits() on ave_wdesc_addr()
  - replace nway_reset and get_link with generic functions
- move operations to proper functions
  - move phy_start_aneg() to ndo_open,
    and remove unnecessary PHY interrupt operations
    See http://www.spinics.net/lists/netdev/msg454590.html
  - move irq initialization and descriptor memory allocation to ndo_open
  - move initialization of reset and clock and mdiobus to ndo_init
- fix skbuffer operations
  - fix skb alignment operations and add Rx buffer adjustment for descriptor
    See http://www.spinics.net/lists/netdev/msg456014.html
  - add error returns when dma_map_single() failed 
- clean up code structures
  - clean up wait-loop and wake-queue conditions
  - add ave_wdesc_addr() and offset definitions
  - add ave_macaddr_init() to clean up mac-address operation
  - fix checking whether Tx entry is not enough
  - fix supported features of phydev
  - add necessary free/disable operations
  - add phydev check on ave_{get,set}_wol()
  - remove netif_carrier functions, phydev initializer, and Tx budget check
- change obsolate codes
  - replace ndev->{base_addr,irq} with the members of ave_private
- rename goto labels and mask definitions, and remove unused codes

Kunihiko Hayashi (2):
  dt-bindings: net: add DT bindings for Socionext UniPhier AVE
  net: ethernet: socionext: add AVE ethernet driver

 .../bindings/net/socionext,uniphier-ave4.txt       |   48 +
 drivers/net/ethernet/Kconfig                       |    1 +
 drivers/net/ethernet/Makefile                      |    1 +
 drivers/net/ethernet/socionext/Kconfig             |   22 +
 drivers/net/ethernet/socionext/Makefile            |    5 +
 drivers/net/ethernet/socionext/sni_ave.c           | 1736 ++++++++++++++++++++
 6 files changed, 1813 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
 create mode 100644 drivers/net/ethernet/socionext/Kconfig
 create mode 100644 drivers/net/ethernet/socionext/Makefile
 create mode 100644 drivers/net/ethernet/socionext/sni_ave.c

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next v9 1/2] dt-bindings: net: add DT bindings for Socionext UniPhier AVE
From: Kunihiko Hayashi @ 2017-12-28  6:58 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: Mark Rutland, Andrew Lunn, Florian Fainelli, Kunihiko Hayashi,
	Masami Hiramatsu, devicetree, linux-kernel, Masahiro Yamada,
	Rob Herring, Jassi Brar, linux-arm-kernel
In-Reply-To: <1514444292-20643-1-git-send-email-hayashi.kunihiko@socionext.com>

DT bindings for the AVE ethernet controller found on Socionext's
UniPhier platforms.

Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
---
 .../bindings/net/socionext,uniphier-ave4.txt       | 48 ++++++++++++++++++++++
 1 file changed, 48 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt

diff --git a/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
new file mode 100644
index 0000000..270ea4e
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/socionext,uniphier-ave4.txt
@@ -0,0 +1,48 @@
+* Socionext AVE ethernet controller
+
+This describes the devicetree bindings for AVE ethernet controller
+implemented on Socionext UniPhier SoCs.
+
+Required properties:
+ - compatible: Should be
+	- "socionext,uniphier-pro4-ave4" : for Pro4 SoC
+	- "socionext,uniphier-pxs2-ave4" : for PXs2 SoC
+	- "socionext,uniphier-ld11-ave4" : for LD11 SoC
+	- "socionext,uniphier-ld20-ave4" : for LD20 SoC
+ - reg: Address where registers are mapped and size of region.
+ - interrupts: Should contain the MAC interrupt.
+ - phy-mode: See ethernet.txt in the same directory. Allow to choose
+	"rgmii", "rmii", or "mii" according to the PHY.
+ - phy-handle: Should point to the external phy device.
+	See ethernet.txt file in the same directory.
+ - clocks: A phandle to the clock for the MAC.
+
+Optional properties:
+ - resets: A phandle to the reset control for the MAC.
+ - local-mac-address: See ethernet.txt in the same directory.
+
+Required subnode:
+ - mdio: A container for child nodes representing phy nodes.
+         See phy.txt in the same directory.
+
+Example:
+
+	ether: ethernet@65000000 {
+		compatible = "socionext,uniphier-ld20-ave4";
+		reg = <0x65000000 0x8500>;
+		interrupts = <0 66 4>;
+		phy-mode = "rgmii";
+		phy-handle = <&ethphy>;
+		clocks = <&sys_clk 6>;
+		resets = <&sys_rst 6>;
+		local-mac-address = [00 00 00 00 00 00];
+
+		mdio {
+			#address-cells = <1>;
+			#size-cells = <0>;
+
+			ethphy: ethphy@1 {
+				reg = <1>;
+			};
+		};
+	};
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v9 2/2] net: ethernet: socionext: add AVE ethernet driver
From: Kunihiko Hayashi @ 2017-12-28  6:58 UTC (permalink / raw)
  To: David Miller, netdev
  Cc: Andrew Lunn, Florian Fainelli, Rob Herring, Mark Rutland,
	linux-arm-kernel, linux-kernel, devicetree, Masahiro Yamada,
	Masami Hiramatsu, Jassi Brar, Kunihiko Hayashi
In-Reply-To: <1514444292-20643-1-git-send-email-hayashi.kunihiko@socionext.com>

The UniPhier platform from Socionext provides the AVE ethernet
controller that includes MAC and MDIO bus supporting RGMII/RMII
modes. The controller is named AVE.

Signed-off-by: Kunihiko Hayashi <hayashi.kunihiko@socionext.com>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
---
 drivers/net/ethernet/Kconfig             |    1 +
 drivers/net/ethernet/Makefile            |    1 +
 drivers/net/ethernet/socionext/Kconfig   |   22 +
 drivers/net/ethernet/socionext/Makefile  |    5 +
 drivers/net/ethernet/socionext/sni_ave.c | 1736 ++++++++++++++++++++++++++++++
 5 files changed, 1765 insertions(+)
 create mode 100644 drivers/net/ethernet/socionext/Kconfig
 create mode 100644 drivers/net/ethernet/socionext/Makefile
 create mode 100644 drivers/net/ethernet/socionext/sni_ave.c

diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index c604213..d50519e 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -170,6 +170,7 @@ source "drivers/net/ethernet/sis/Kconfig"
 source "drivers/net/ethernet/sfc/Kconfig"
 source "drivers/net/ethernet/sgi/Kconfig"
 source "drivers/net/ethernet/smsc/Kconfig"
+source "drivers/net/ethernet/socionext/Kconfig"
 source "drivers/net/ethernet/stmicro/Kconfig"
 source "drivers/net/ethernet/sun/Kconfig"
 source "drivers/net/ethernet/tehuti/Kconfig"
diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
index 39f62733..6cf5ade 100644
--- a/drivers/net/ethernet/Makefile
+++ b/drivers/net/ethernet/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_SFC) += sfc/
 obj-$(CONFIG_SFC_FALCON) += sfc/falcon/
 obj-$(CONFIG_NET_VENDOR_SGI) += sgi/
 obj-$(CONFIG_NET_VENDOR_SMSC) += smsc/
+obj-$(CONFIG_NET_VENDOR_SOCIONEXT) += socionext/
 obj-$(CONFIG_NET_VENDOR_STMICRO) += stmicro/
 obj-$(CONFIG_NET_VENDOR_SUN) += sun/
 obj-$(CONFIG_NET_VENDOR_TEHUTI) += tehuti/
diff --git a/drivers/net/ethernet/socionext/Kconfig b/drivers/net/ethernet/socionext/Kconfig
new file mode 100644
index 0000000..3a1829e
--- /dev/null
+++ b/drivers/net/ethernet/socionext/Kconfig
@@ -0,0 +1,22 @@
+config NET_VENDOR_SOCIONEXT
+	bool "Socionext ethernet drivers"
+	default y
+	---help---
+	  Option to select ethernet drivers for Socionext platforms.
+
+	  Note that the answer to this question doesn't directly affect the
+	  kernel: saying N will just cause the configurator to skip all
+	  the questions about Socionext devices. If you say Y, you will be asked
+	  for your specific card in the following questions.
+
+if NET_VENDOR_SOCIONEXT
+
+config SNI_AVE
+	tristate "Socionext AVE ethernet support"
+	depends on (ARCH_UNIPHIER || COMPILE_TEST) && OF
+	select PHYLIB
+	---help---
+	  Driver for gigabit ethernet MACs, called AVE, in the
+	  Socionext UniPhier family.
+
+endif #NET_VENDOR_SOCIONEXT
diff --git a/drivers/net/ethernet/socionext/Makefile b/drivers/net/ethernet/socionext/Makefile
new file mode 100644
index 0000000..ab83df6
--- /dev/null
+++ b/drivers/net/ethernet/socionext/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for all ethernet ip drivers on Socionext platforms
+#
+obj-$(CONFIG_SNI_AVE) += sni_ave.o
diff --git a/drivers/net/ethernet/socionext/sni_ave.c b/drivers/net/ethernet/socionext/sni_ave.c
new file mode 100644
index 0000000..111e7ca
--- /dev/null
+++ b/drivers/net/ethernet/socionext/sni_ave.c
@@ -0,0 +1,1736 @@
+// SPDX-License-Identifier: GPL-2.0
+/**
+ * sni_ave.c - Socionext UniPhier AVE ethernet driver
+ * Copyright 2014 Panasonic Corporation
+ * Copyright 2015-2017 Socionext Inc.
+ */
+
+#include <linux/bitops.h>
+#include <linux/clk.h>
+#include <linux/etherdevice.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/iopoll.h>
+#include <linux/mii.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/of_net.h>
+#include <linux/of_mdio.h>
+#include <linux/of_platform.h>
+#include <linux/phy.h>
+#include <linux/reset.h>
+#include <linux/types.h>
+#include <linux/u64_stats_sync.h>
+
+/* General Register Group */
+#define AVE_IDR			0x000	/* ID */
+#define AVE_VR			0x004	/* Version */
+#define AVE_GRR			0x008	/* Global Reset */
+#define AVE_CFGR		0x00c	/* Configuration */
+
+/* Interrupt Register Group */
+#define AVE_GIMR		0x100	/* Global Interrupt Mask */
+#define AVE_GISR		0x104	/* Global Interrupt Status */
+
+/* MAC Register Group */
+#define AVE_TXCR		0x200	/* TX Setup */
+#define AVE_RXCR		0x204	/* RX Setup */
+#define AVE_RXMAC1R		0x208	/* MAC address (lower) */
+#define AVE_RXMAC2R		0x20c	/* MAC address (upper) */
+#define AVE_MDIOCTR		0x214	/* MDIO Control */
+#define AVE_MDIOAR		0x218	/* MDIO Address */
+#define AVE_MDIOWDR		0x21c	/* MDIO Data */
+#define AVE_MDIOSR		0x220	/* MDIO Status */
+#define AVE_MDIORDR		0x224	/* MDIO Rd Data */
+
+/* Descriptor Control Register Group */
+#define AVE_DESCC		0x300	/* Descriptor Control */
+#define AVE_TXDC		0x304	/* TX Descriptor Configuration */
+#define AVE_RXDC0		0x308	/* RX Descriptor Ring0 Configuration */
+#define AVE_IIRQC		0x34c	/* Interval IRQ Control */
+
+/* Packet Filter Register Group */
+#define AVE_PKTF_BASE		0x800	/* PF Base Address */
+#define AVE_PFMBYTE_BASE	0xd00	/* PF Mask Byte Base Address */
+#define AVE_PFMBIT_BASE		0xe00	/* PF Mask Bit Base Address */
+#define AVE_PFSEL_BASE		0xf00	/* PF Selector Base Address */
+#define AVE_PFEN		0xffc	/* Packet Filter Enable */
+#define AVE_PKTF(ent)		(AVE_PKTF_BASE + (ent) * 0x40)
+#define AVE_PFMBYTE(ent)	(AVE_PFMBYTE_BASE + (ent) * 8)
+#define AVE_PFMBIT(ent)		(AVE_PFMBIT_BASE + (ent) * 4)
+#define AVE_PFSEL(ent)		(AVE_PFSEL_BASE + (ent) * 4)
+
+/* 64bit descriptor memory */
+#define AVE_DESC_SIZE_64	12	/* Descriptor Size */
+
+#define AVE_TXDM_64		0x1000	/* Tx Descriptor Memory */
+#define AVE_RXDM_64		0x1c00	/* Rx Descriptor Memory */
+
+#define AVE_TXDM_SIZE_64	0x0ba0	/* Tx Descriptor Memory Size 3KB */
+#define AVE_RXDM_SIZE_64	0x6000	/* Rx Descriptor Memory Size 24KB */
+
+/* 32bit descriptor memory */
+#define AVE_DESC_SIZE_32	8	/* Descriptor Size */
+
+#define AVE_TXDM_32		0x1000	/* Tx Descriptor Memory */
+#define AVE_RXDM_32		0x1800	/* Rx Descriptor Memory */
+
+#define AVE_TXDM_SIZE_32	0x07c0	/* Tx Descriptor Memory Size 2KB */
+#define AVE_RXDM_SIZE_32	0x4000	/* Rx Descriptor Memory Size 16KB */
+
+/* RMII Bridge Register Group */
+#define AVE_RSTCTRL		0x8028	/* Reset control */
+#define AVE_RSTCTRL_RMIIRST	BIT(16)
+#define AVE_LINKSEL		0x8034	/* Link speed setting */
+#define AVE_LINKSEL_100M	BIT(0)
+
+/* AVE_GRR */
+#define AVE_GRR_RXFFR		BIT(5)	/* Reset RxFIFO */
+#define AVE_GRR_PHYRST		BIT(4)	/* Reset external PHY */
+#define AVE_GRR_GRST		BIT(0)	/* Reset all MAC */
+
+/* AVE_CFGR */
+#define AVE_CFGR_FLE		BIT(31)	/* Filter Function */
+#define AVE_CFGR_CHE		BIT(30)	/* Checksum Function */
+#define AVE_CFGR_MII		BIT(27)	/* Func mode (1:MII/RMII, 0:RGMII) */
+#define AVE_CFGR_IPFCEN		BIT(24)	/* IP fragment sum Enable */
+
+/* AVE_GISR (common with GIMR) */
+#define AVE_GI_PHY		BIT(24)	/* PHY interrupt */
+#define AVE_GI_TX		BIT(16)	/* Tx complete */
+#define AVE_GI_RXERR		BIT(8)	/* Receive frame more than max size */
+#define AVE_GI_RXOVF		BIT(7)	/* Overflow at the RxFIFO */
+#define AVE_GI_RXDROP		BIT(6)	/* Drop packet */
+#define AVE_GI_RXIINT		BIT(5)	/* Interval interrupt */
+
+/* AVE_TXCR */
+#define AVE_TXCR_FLOCTR		BIT(18)	/* Flow control */
+#define AVE_TXCR_TXSPD_1G	BIT(17)
+#define AVE_TXCR_TXSPD_100	BIT(16)
+
+/* AVE_RXCR */
+#define AVE_RXCR_RXEN		BIT(30)	/* Rx enable */
+#define AVE_RXCR_FDUPEN		BIT(22)	/* Interface mode */
+#define AVE_RXCR_FLOCTR		BIT(21)	/* Flow control */
+#define AVE_RXCR_AFEN		BIT(19)	/* MAC address filter */
+#define AVE_RXCR_DRPEN		BIT(18)	/* Drop pause frame */
+#define AVE_RXCR_MPSIZ_MASK	GENMASK(10, 0)
+
+/* AVE_MDIOCTR */
+#define AVE_MDIOCTR_RREQ	BIT(3)	/* Read request */
+#define AVE_MDIOCTR_WREQ	BIT(2)	/* Write request */
+
+/* AVE_MDIOSR */
+#define AVE_MDIOSR_STS		BIT(0)	/* access status */
+
+/* AVE_DESCC */
+#define AVE_DESCC_STATUS_MASK	GENMASK(31, 16)
+#define AVE_DESCC_RD0		BIT(8)	/* Enable Rx descriptor Ring0 */
+#define AVE_DESCC_RDSTP		BIT(4)	/* Pause Rx descriptor */
+#define AVE_DESCC_TD		BIT(0)	/* Enable Tx descriptor */
+
+/* AVE_TXDC */
+#define AVE_TXDC_SIZE		GENMASK(27, 16)	/* Size of Tx descriptor */
+#define AVE_TXDC_ADDR		GENMASK(11, 0)	/* Start address */
+#define AVE_TXDC_ADDR_START	0
+
+/* AVE_RXDC0 */
+#define AVE_RXDC0_SIZE		GENMASK(30, 16)	/* Size of Rx descriptor */
+#define AVE_RXDC0_ADDR		GENMASK(14, 0)	/* Start address */
+#define AVE_RXDC0_ADDR_START	0
+
+/* AVE_IIRQC */
+#define AVE_IIRQC_EN0		BIT(27)	/* Enable interval interrupt Ring0 */
+#define AVE_IIRQC_BSCK		GENMASK(15, 0)	/* Interval count unit */
+
+/* Command status for descriptor */
+#define AVE_STS_OWN		BIT(31)	/* Descriptor ownership */
+#define AVE_STS_INTR		BIT(29)	/* Request for interrupt */
+#define AVE_STS_OK		BIT(27)	/* Normal transmit */
+/* TX */
+#define AVE_STS_NOCSUM		BIT(28)	/* No use HW checksum */
+#define AVE_STS_1ST		BIT(26)	/* Head of buffer chain */
+#define AVE_STS_LAST		BIT(25)	/* Tail of buffer chain */
+#define AVE_STS_OWC		BIT(21)	/* Out of window,Late Collision */
+#define AVE_STS_EC		BIT(20)	/* Excess collision occurred */
+#define AVE_STS_PKTLEN_TX_MASK	GENMASK(15, 0)
+/* RX */
+#define AVE_STS_CSSV		BIT(21)	/* Checksum check performed */
+#define AVE_STS_CSER		BIT(20)	/* Checksum error detected */
+#define AVE_STS_PKTLEN_RX_MASK	GENMASK(10, 0)
+
+/* Packet filter */
+#define AVE_PFMBYTE_MASK0	(GENMASK(31, 8) | GENMASK(5, 0))
+#define AVE_PFMBYTE_MASK1	GENMASK(25, 0)
+#define AVE_PFMBIT_MASK		GENMASK(15, 0)
+
+#define AVE_PF_SIZE		17	/* Number of all packet filter */
+#define AVE_PF_MULTICAST_SIZE	7	/* Number of multicast filter */
+
+#define AVE_PFNUM_FILTER	0	/* No.0 */
+#define AVE_PFNUM_UNICAST	1	/* No.1 */
+#define AVE_PFNUM_BROADCAST	2	/* No.2 */
+#define AVE_PFNUM_MULTICAST	11	/* No.11-17 */
+
+/* NETIF Message control */
+#define AVE_DEFAULT_MSG_ENABLE	(NETIF_MSG_DRV    |	\
+				 NETIF_MSG_PROBE  |	\
+				 NETIF_MSG_LINK   |	\
+				 NETIF_MSG_TIMER  |	\
+				 NETIF_MSG_IFDOWN |	\
+				 NETIF_MSG_IFUP   |	\
+				 NETIF_MSG_RX_ERR |	\
+				 NETIF_MSG_TX_ERR)
+
+/* Parameter for descriptor */
+#define AVE_NR_TXDESC		32	/* Tx descriptor */
+#define AVE_NR_RXDESC		64	/* Rx descriptor */
+
+#define AVE_DESC_OFS_CMDSTS	0
+#define AVE_DESC_OFS_ADDRL	4
+#define AVE_DESC_OFS_ADDRU	8
+
+/* Parameter for ethernet frame */
+#define AVE_MAX_ETHFRAME	1518
+
+/* Parameter for interrupt */
+#define AVE_INTM_COUNT		20
+#define AVE_FORCE_TXINTCNT	1
+
+#define IS_DESC_64BIT(p)	((p)->data->is_desc_64bit)
+
+enum desc_id {
+	AVE_DESCID_RX,
+	AVE_DESCID_TX,
+};
+
+enum desc_state {
+	AVE_DESC_RX_PERMIT,
+	AVE_DESC_RX_SUSPEND,
+	AVE_DESC_START,
+	AVE_DESC_STOP,
+};
+
+struct ave_desc {
+	struct sk_buff	*skbs;
+	dma_addr_t	skbs_dma;
+	size_t		skbs_dmalen;
+};
+
+struct ave_desc_info {
+	u32	ndesc;		/* number of descriptor */
+	u32	daddr;		/* start address of descriptor */
+	u32	proc_idx;	/* index of processing packet */
+	u32	done_idx;	/* index of processed packet */
+	struct ave_desc *desc;	/* skb info related descriptor */
+};
+
+struct ave_soc_data {
+	bool	is_desc_64bit;
+};
+
+struct ave_stats {
+	struct	u64_stats_sync	syncp;
+	u64	packets;
+	u64	bytes;
+	u64	errors;
+	u64	dropped;
+	u64	collisions;
+	u64	fifo_errors;
+};
+
+struct ave_private {
+	void __iomem            *base;
+	int                     irq;
+	int			phy_id;
+	unsigned int		desc_size;
+	u32			msg_enable;
+	struct clk		*clk;
+	struct reset_control	*rst;
+	phy_interface_t		phy_mode;
+	struct phy_device	*phydev;
+	struct mii_bus		*mdio;
+
+	/* stats */
+	struct ave_stats	stats_rx;
+	struct ave_stats	stats_tx;
+
+	/* NAPI support */
+	struct net_device	*ndev;
+	struct napi_struct	napi_rx;
+	struct napi_struct	napi_tx;
+
+	/* descriptor */
+	struct ave_desc_info	rx;
+	struct ave_desc_info	tx;
+
+	/* flow control */
+	int pause_auto;
+	int pause_rx;
+	int pause_tx;
+
+	const struct ave_soc_data *data;
+};
+
+static u32 ave_desc_read(struct net_device *ndev, enum desc_id id, int entry,
+			 int offset)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 addr;
+
+	addr = ((id == AVE_DESCID_TX) ? priv->tx.daddr : priv->rx.daddr)
+		+ entry * priv->desc_size + offset;
+
+	return readl(priv->base + addr);
+}
+
+static u32 ave_desc_read_cmdsts(struct net_device *ndev, enum desc_id id,
+				int entry)
+{
+	return ave_desc_read(ndev, id, entry, AVE_DESC_OFS_CMDSTS);
+}
+
+static void ave_desc_write(struct net_device *ndev, enum desc_id id,
+			   int entry, int offset, u32 val)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 addr;
+
+	addr = ((id == AVE_DESCID_TX) ? priv->tx.daddr : priv->rx.daddr)
+		+ entry * priv->desc_size + offset;
+
+	writel(val, priv->base + addr);
+}
+
+static void ave_desc_write_cmdsts(struct net_device *ndev, enum desc_id id,
+				  int entry, u32 val)
+{
+	ave_desc_write(ndev, id, entry, AVE_DESC_OFS_CMDSTS, val);
+}
+
+static void ave_desc_write_addr(struct net_device *ndev, enum desc_id id,
+				int entry, dma_addr_t paddr)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+
+	ave_desc_write(ndev, id, entry, AVE_DESC_OFS_ADDRL,
+		       lower_32_bits(paddr));
+	if (IS_DESC_64BIT(priv))
+		ave_desc_write(ndev, id,
+			       entry, AVE_DESC_OFS_ADDRU,
+			       upper_32_bits(paddr));
+}
+
+static u32 ave_irq_disable_all(struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 ret;
+
+	ret = readl(priv->base + AVE_GIMR);
+	writel(0, priv->base + AVE_GIMR);
+
+	return ret;
+}
+
+static void ave_irq_restore(struct net_device *ndev, u32 val)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+
+	writel(val, priv->base + AVE_GIMR);
+}
+
+static void ave_irq_enable(struct net_device *ndev, u32 bitflag)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+
+	writel(readl(priv->base + AVE_GIMR) | bitflag, priv->base + AVE_GIMR);
+	writel(bitflag, priv->base + AVE_GISR);
+}
+
+static void ave_hw_write_macaddr(struct net_device *ndev,
+				 const unsigned char *mac_addr,
+				 int reg1, int reg2)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+
+	writel(mac_addr[0] | mac_addr[1] << 8 |
+	       mac_addr[2] << 16 | mac_addr[3] << 24, priv->base + reg1);
+	writel(mac_addr[4] | mac_addr[5] << 8, priv->base + reg2);
+}
+
+static void ave_hw_read_version(struct net_device *ndev, char *buf, int len)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 major, minor, vr;
+
+	vr = readl(priv->base + AVE_VR);
+	major = (vr & GENMASK(15, 8)) >> 8;
+	minor = (vr & GENMASK(7, 0));
+	snprintf(buf, len, "v%u.%u", major, minor);
+}
+
+static void ave_ethtool_get_drvinfo(struct net_device *ndev,
+				    struct ethtool_drvinfo *info)
+{
+	struct device *dev = ndev->dev.parent;
+
+	strlcpy(info->driver, dev->driver->name, sizeof(info->driver));
+	strlcpy(info->bus_info, dev_name(dev), sizeof(info->bus_info));
+	ave_hw_read_version(ndev, info->fw_version, sizeof(info->fw_version));
+}
+
+static u32 ave_ethtool_get_msglevel(struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+
+	return priv->msg_enable;
+}
+
+static void ave_ethtool_set_msglevel(struct net_device *ndev, u32 val)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+
+	priv->msg_enable = val;
+}
+
+static void ave_ethtool_get_wol(struct net_device *ndev,
+				struct ethtool_wolinfo *wol)
+{
+	wol->supported = 0;
+	wol->wolopts   = 0;
+
+	if (ndev->phydev)
+		phy_ethtool_get_wol(ndev->phydev, wol);
+}
+
+static int ave_ethtool_set_wol(struct net_device *ndev,
+			       struct ethtool_wolinfo *wol)
+{
+	int ret;
+
+	if (!ndev->phydev ||
+	    (wol->wolopts & (WAKE_ARP | WAKE_MAGICSECURE)))
+		return -EOPNOTSUPP;
+
+	ret = phy_ethtool_set_wol(ndev->phydev, wol);
+	if (!ret)
+		device_set_wakeup_enable(&ndev->dev, !!wol->wolopts);
+
+	return ret;
+}
+
+static void ave_ethtool_get_pauseparam(struct net_device *ndev,
+				       struct ethtool_pauseparam *pause)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+
+	pause->autoneg  = priv->pause_auto;
+	pause->rx_pause = priv->pause_rx;
+	pause->tx_pause = priv->pause_tx;
+}
+
+static int ave_ethtool_set_pauseparam(struct net_device *ndev,
+				      struct ethtool_pauseparam *pause)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	struct phy_device *phydev = ndev->phydev;
+
+	if (!phydev)
+		return -EINVAL;
+
+	priv->pause_auto = pause->autoneg;
+	priv->pause_rx   = pause->rx_pause;
+	priv->pause_tx   = pause->tx_pause;
+
+	phydev->advertising &= ~(ADVERTISED_Pause | ADVERTISED_Asym_Pause);
+	if (pause->rx_pause)
+		phydev->advertising |= ADVERTISED_Pause | ADVERTISED_Asym_Pause;
+	if (pause->tx_pause)
+		phydev->advertising ^= ADVERTISED_Asym_Pause;
+
+	if (pause->autoneg) {
+		if (netif_running(ndev))
+			phy_start_aneg(phydev);
+	}
+
+	return 0;
+}
+
+static const struct ethtool_ops ave_ethtool_ops = {
+	.get_link_ksettings	= phy_ethtool_get_link_ksettings,
+	.set_link_ksettings	= phy_ethtool_set_link_ksettings,
+	.get_drvinfo		= ave_ethtool_get_drvinfo,
+	.nway_reset		= phy_ethtool_nway_reset,
+	.get_link		= ethtool_op_get_link,
+	.get_msglevel		= ave_ethtool_get_msglevel,
+	.set_msglevel		= ave_ethtool_set_msglevel,
+	.get_wol		= ave_ethtool_get_wol,
+	.set_wol		= ave_ethtool_set_wol,
+	.get_pauseparam         = ave_ethtool_get_pauseparam,
+	.set_pauseparam         = ave_ethtool_set_pauseparam,
+};
+
+static int ave_mdiobus_read(struct mii_bus *bus, int phyid, int regnum)
+{
+	struct net_device *ndev = bus->priv;
+	struct ave_private *priv;
+	u32 mdioctl, mdiosr;
+	int ret;
+
+	priv = netdev_priv(ndev);
+
+	/* write address */
+	writel((phyid << 8) | regnum, priv->base + AVE_MDIOAR);
+
+	/* read request */
+	mdioctl = readl(priv->base + AVE_MDIOCTR);
+	writel((mdioctl | AVE_MDIOCTR_RREQ) & ~AVE_MDIOCTR_WREQ,
+	       priv->base + AVE_MDIOCTR);
+
+	ret = readl_poll_timeout(priv->base + AVE_MDIOSR, mdiosr,
+				 !(mdiosr & AVE_MDIOSR_STS), 20, 2000);
+	if (ret) {
+		netdev_err(ndev, "failed to read (phy:%d reg:%x)\n",
+			   phyid, regnum);
+		return ret;
+	}
+
+	return readl(priv->base + AVE_MDIORDR) & GENMASK(15, 0);
+}
+
+static int ave_mdiobus_write(struct mii_bus *bus, int phyid, int regnum,
+			     u16 val)
+{
+	struct net_device *ndev = bus->priv;
+	struct ave_private *priv;
+	u32 mdioctl, mdiosr;
+	int ret;
+
+	priv = netdev_priv(ndev);
+
+	/* write address */
+	writel((phyid << 8) | regnum, priv->base + AVE_MDIOAR);
+
+	/* write data */
+	writel(val, priv->base + AVE_MDIOWDR);
+
+	/* write request */
+	mdioctl = readl(priv->base + AVE_MDIOCTR);
+	writel((mdioctl | AVE_MDIOCTR_WREQ) & ~AVE_MDIOCTR_RREQ,
+	       priv->base + AVE_MDIOCTR);
+
+	ret = readl_poll_timeout(priv->base + AVE_MDIOSR, mdiosr,
+				 !(mdiosr & AVE_MDIOSR_STS), 20, 2000);
+	if (ret)
+		netdev_err(ndev, "failed to write (phy:%d reg:%x)\n",
+			   phyid, regnum);
+
+	return ret;
+}
+
+static int ave_dma_map(struct net_device *ndev, struct ave_desc *desc,
+		       void *ptr, size_t len, enum dma_data_direction dir,
+		       dma_addr_t *paddr)
+{
+	dma_addr_t map_addr;
+
+	map_addr = dma_map_single(ndev->dev.parent, ptr, len, dir);
+	if (unlikely(dma_mapping_error(ndev->dev.parent, map_addr)))
+		return -ENOMEM;
+
+	desc->skbs_dma = map_addr;
+	desc->skbs_dmalen = len;
+	*paddr = map_addr;
+
+	return 0;
+}
+
+static void ave_dma_unmap(struct net_device *ndev, struct ave_desc *desc,
+			  enum dma_data_direction dir)
+{
+	if (!desc->skbs_dma)
+		return;
+
+	dma_unmap_single(ndev->dev.parent,
+			 desc->skbs_dma, desc->skbs_dmalen, dir);
+	desc->skbs_dma = 0;
+}
+
+/* Prepare Rx descriptor and memory */
+static int ave_rxdesc_prepare(struct net_device *ndev, int entry)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	struct sk_buff *skb;
+	dma_addr_t paddr;
+	int ret;
+
+	skb = priv->rx.desc[entry].skbs;
+	if (!skb) {
+		skb = netdev_alloc_skb_ip_align(ndev,
+						AVE_MAX_ETHFRAME);
+		if (!skb) {
+			netdev_err(ndev, "can't allocate skb for Rx\n");
+			return -ENOMEM;
+		}
+	}
+
+	/* set disable to cmdsts */
+	ave_desc_write_cmdsts(ndev, AVE_DESCID_RX, entry,
+			      AVE_STS_INTR | AVE_STS_OWN);
+
+	/* map Rx buffer
+	 * Rx buffer set to the Rx descriptor has two restrictions:
+	 * - Rx buffer address is 4 byte aligned.
+	 * - Rx buffer begins with 2 byte headroom, and data will be put from
+	 *   (buffer + 2).
+	 * To satisfy this, specify the address to put back the buffer
+	 * pointer advanced by NET_IP_ALIGN by netdev_alloc_skb_ip_align(),
+	 * and expand the map size by NET_IP_ALIGN.
+	 */
+	ret = ave_dma_map(ndev, &priv->rx.desc[entry],
+			  skb->data - NET_IP_ALIGN,
+			  AVE_MAX_ETHFRAME + NET_IP_ALIGN,
+			  DMA_FROM_DEVICE, &paddr);
+	if (ret) {
+		netdev_err(ndev, "can't map skb for Rx\n");
+		dev_kfree_skb_any(skb);
+		return ret;
+	}
+	priv->rx.desc[entry].skbs = skb;
+
+	/* set buffer pointer */
+	ave_desc_write_addr(ndev, AVE_DESCID_RX, entry, paddr);
+
+	/* set enable to cmdsts */
+	ave_desc_write_cmdsts(ndev, AVE_DESCID_RX, entry,
+			      AVE_STS_INTR | AVE_MAX_ETHFRAME);
+
+	return ret;
+}
+
+/* Switch state of descriptor */
+static int ave_desc_switch(struct net_device *ndev, enum desc_state state)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	int ret = 0;
+	u32 val;
+
+	switch (state) {
+	case AVE_DESC_START:
+		writel(AVE_DESCC_TD | AVE_DESCC_RD0, priv->base + AVE_DESCC);
+		break;
+
+	case AVE_DESC_STOP:
+		writel(0, priv->base + AVE_DESCC);
+		if (readl_poll_timeout(priv->base + AVE_DESCC, val, !val,
+				       150, 15000)) {
+			netdev_err(ndev, "can't stop descriptor\n");
+			ret = -EBUSY;
+		}
+		break;
+
+	case AVE_DESC_RX_SUSPEND:
+		val = readl(priv->base + AVE_DESCC);
+		val |= AVE_DESCC_RDSTP;
+		val &= ~AVE_DESCC_STATUS_MASK;
+		writel(val, priv->base + AVE_DESCC);
+		if (readl_poll_timeout(priv->base + AVE_DESCC, val,
+				       val & (AVE_DESCC_RDSTP << 16),
+				       150, 150000)) {
+			netdev_err(ndev, "can't suspend descriptor\n");
+			ret = -EBUSY;
+		}
+		break;
+
+	case AVE_DESC_RX_PERMIT:
+		val = readl(priv->base + AVE_DESCC);
+		val &= ~AVE_DESCC_RDSTP;
+		val &= ~AVE_DESCC_STATUS_MASK;
+		writel(val, priv->base + AVE_DESCC);
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static int ave_tx_complete(struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 proc_idx, done_idx, ndesc, cmdsts;
+	unsigned int nr_freebuf = 0;
+	unsigned int tx_packets = 0;
+	unsigned int tx_bytes = 0;
+
+	proc_idx = priv->tx.proc_idx;
+	done_idx = priv->tx.done_idx;
+	ndesc    = priv->tx.ndesc;
+
+	/* free pre-stored skb from done_idx to proc_idx */
+	while (proc_idx != done_idx) {
+		cmdsts = ave_desc_read_cmdsts(ndev, AVE_DESCID_TX, done_idx);
+
+		/* do nothing if owner is HW (==1 for Tx) */
+		if (cmdsts & AVE_STS_OWN)
+			break;
+
+		/* check Tx status and updates statistics */
+		if (cmdsts & AVE_STS_OK) {
+			tx_bytes += cmdsts & AVE_STS_PKTLEN_TX_MASK;
+			/* success */
+			if (cmdsts & AVE_STS_LAST)
+				tx_packets++;
+		} else {
+			/* error */
+			if (cmdsts & AVE_STS_LAST) {
+				priv->stats_tx.errors++;
+				if (cmdsts & (AVE_STS_OWC | AVE_STS_EC))
+					priv->stats_tx.collisions++;
+			}
+		}
+
+		/* release skb */
+		if (priv->tx.desc[done_idx].skbs) {
+			ave_dma_unmap(ndev, &priv->tx.desc[done_idx],
+				      DMA_TO_DEVICE);
+			dev_consume_skb_any(priv->tx.desc[done_idx].skbs);
+			priv->tx.desc[done_idx].skbs = NULL;
+			nr_freebuf++;
+		}
+		done_idx = (done_idx + 1) % ndesc;
+	}
+
+	priv->tx.done_idx = done_idx;
+
+	/* update stats */
+	u64_stats_update_begin(&priv->stats_tx.syncp);
+	priv->stats_tx.packets += tx_packets;
+	priv->stats_tx.bytes   += tx_bytes;
+	u64_stats_update_end(&priv->stats_tx.syncp);
+
+	/* wake queue for freeing buffer */
+	if (unlikely(netif_queue_stopped(ndev)) && nr_freebuf)
+		netif_wake_queue(ndev);
+
+	return nr_freebuf;
+}
+
+static int ave_rx_receive(struct net_device *ndev, int num)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	unsigned int rx_packets = 0;
+	unsigned int rx_bytes = 0;
+	u32 proc_idx, done_idx;
+	struct sk_buff *skb;
+	unsigned int pktlen;
+	int restpkt, npkts;
+	u32 ndesc, cmdsts;
+
+	proc_idx = priv->rx.proc_idx;
+	done_idx = priv->rx.done_idx;
+	ndesc    = priv->rx.ndesc;
+	restpkt  = ((proc_idx + ndesc - 1) - done_idx) % ndesc;
+
+	for (npkts = 0; npkts < num; npkts++) {
+		/* we can't receive more packet, so fill desc quickly */
+		if (--restpkt < 0)
+			break;
+
+		cmdsts = ave_desc_read_cmdsts(ndev, AVE_DESCID_RX, proc_idx);
+
+		/* do nothing if owner is HW (==0 for Rx) */
+		if (!(cmdsts & AVE_STS_OWN))
+			break;
+
+		if (!(cmdsts & AVE_STS_OK)) {
+			priv->stats_rx.errors++;
+			proc_idx = (proc_idx + 1) % ndesc;
+			continue;
+		}
+
+		pktlen = cmdsts & AVE_STS_PKTLEN_RX_MASK;
+
+		/* get skbuff for rx */
+		skb = priv->rx.desc[proc_idx].skbs;
+		priv->rx.desc[proc_idx].skbs = NULL;
+
+		ave_dma_unmap(ndev, &priv->rx.desc[proc_idx], DMA_FROM_DEVICE);
+
+		skb->dev = ndev;
+		skb_put(skb, pktlen);
+		skb->protocol = eth_type_trans(skb, ndev);
+
+		if ((cmdsts & AVE_STS_CSSV) && (!(cmdsts & AVE_STS_CSER)))
+			skb->ip_summed = CHECKSUM_UNNECESSARY;
+
+		rx_packets++;
+		rx_bytes += pktlen;
+
+		netif_receive_skb(skb);
+
+		proc_idx = (proc_idx + 1) % ndesc;
+	}
+
+	priv->rx.proc_idx = proc_idx;
+
+	/* update stats */
+	u64_stats_update_begin(&priv->stats_rx.syncp);
+	priv->stats_rx.packets += rx_packets;
+	priv->stats_rx.bytes   += rx_bytes;
+	u64_stats_update_end(&priv->stats_rx.syncp);
+
+	/* refill the Rx buffers */
+	while (proc_idx != done_idx) {
+		if (ave_rxdesc_prepare(ndev, done_idx))
+			break;
+		done_idx = (done_idx + 1) % ndesc;
+	}
+
+	priv->rx.done_idx = done_idx;
+
+	return npkts;
+}
+
+static int ave_napi_poll_rx(struct napi_struct *napi, int budget)
+{
+	struct ave_private *priv;
+	struct net_device *ndev;
+	int num;
+
+	priv = container_of(napi, struct ave_private, napi_rx);
+	ndev = priv->ndev;
+
+	num = ave_rx_receive(ndev, budget);
+	if (num < budget) {
+		napi_complete_done(napi, num);
+
+		/* enable Rx interrupt when NAPI finishes */
+		ave_irq_enable(ndev, AVE_GI_RXIINT);
+	}
+
+	return num;
+}
+
+static int ave_napi_poll_tx(struct napi_struct *napi, int budget)
+{
+	struct ave_private *priv;
+	struct net_device *ndev;
+	int num;
+
+	priv = container_of(napi, struct ave_private, napi_tx);
+	ndev = priv->ndev;
+
+	num = ave_tx_complete(ndev);
+	napi_complete(napi);
+
+	/* enable Tx interrupt when NAPI finishes */
+	ave_irq_enable(ndev, AVE_GI_TX);
+
+	return num;
+}
+
+static void ave_global_reset(struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 val;
+
+	/* set config register */
+	val = AVE_CFGR_FLE | AVE_CFGR_IPFCEN | AVE_CFGR_CHE;
+	if (!phy_interface_mode_is_rgmii(priv->phy_mode))
+		val |= AVE_CFGR_MII;
+	writel(val, priv->base + AVE_CFGR);
+
+	/* reset RMII register */
+	val = readl(priv->base + AVE_RSTCTRL);
+	val &= ~AVE_RSTCTRL_RMIIRST;
+	writel(val, priv->base + AVE_RSTCTRL);
+
+	/* assert reset */
+	writel(AVE_GRR_GRST | AVE_GRR_PHYRST, priv->base + AVE_GRR);
+	msleep(20);
+
+	/* 1st, negate PHY reset only */
+	writel(AVE_GRR_GRST, priv->base + AVE_GRR);
+	msleep(40);
+
+	/* negate reset */
+	writel(0, priv->base + AVE_GRR);
+	msleep(40);
+
+	/* negate RMII register */
+	val = readl(priv->base + AVE_RSTCTRL);
+	val |= AVE_RSTCTRL_RMIIRST;
+	writel(val, priv->base + AVE_RSTCTRL);
+
+	ave_irq_disable_all(ndev);
+}
+
+static void ave_rxfifo_reset(struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 rxcr_org;
+
+	/* save and disable MAC receive op */
+	rxcr_org = readl(priv->base + AVE_RXCR);
+	writel(rxcr_org & (~AVE_RXCR_RXEN), priv->base + AVE_RXCR);
+
+	/* suspend Rx descriptor */
+	ave_desc_switch(ndev, AVE_DESC_RX_SUSPEND);
+
+	/* receive all packets before descriptor starts */
+	ave_rx_receive(ndev, priv->rx.ndesc);
+
+	/* assert reset */
+	writel(AVE_GRR_RXFFR, priv->base + AVE_GRR);
+	usleep_range(40, 50);
+
+	/* negate reset */
+	writel(0, priv->base + AVE_GRR);
+	usleep_range(10, 20);
+
+	/* negate interrupt status */
+	writel(AVE_GI_RXOVF, priv->base + AVE_GISR);
+
+	/* permit descriptor */
+	ave_desc_switch(ndev, AVE_DESC_RX_PERMIT);
+
+	/* restore MAC reccieve op */
+	writel(rxcr_org, priv->base + AVE_RXCR);
+}
+
+static irqreturn_t ave_irq_handler(int irq, void *netdev)
+{
+	struct net_device *ndev = (struct net_device *)netdev;
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 gimr_val, gisr_val;
+
+	gimr_val = ave_irq_disable_all(ndev);
+
+	/* get interrupt status */
+	gisr_val = readl(priv->base + AVE_GISR);
+
+	/* PHY */
+	if (gisr_val & AVE_GI_PHY)
+		writel(AVE_GI_PHY, priv->base + AVE_GISR);
+
+	/* check exceeding packet */
+	if (gisr_val & AVE_GI_RXERR) {
+		writel(AVE_GI_RXERR, priv->base + AVE_GISR);
+		netdev_err(ndev, "receive a packet exceeding frame buffer\n");
+	}
+
+	gisr_val &= gimr_val;
+	if (!gisr_val)
+		goto exit_isr;
+
+	/* RxFIFO overflow */
+	if (gisr_val & AVE_GI_RXOVF) {
+		priv->stats_rx.fifo_errors++;
+		ave_rxfifo_reset(ndev);
+		goto exit_isr;
+	}
+
+	/* Rx drop */
+	if (gisr_val & AVE_GI_RXDROP) {
+		priv->stats_rx.dropped++;
+		writel(AVE_GI_RXDROP, priv->base + AVE_GISR);
+	}
+
+	/* Rx interval */
+	if (gisr_val & AVE_GI_RXIINT) {
+		napi_schedule(&priv->napi_rx);
+		/* still force to disable Rx interrupt until NAPI finishes */
+		gimr_val &= ~AVE_GI_RXIINT;
+	}
+
+	/* Tx completed */
+	if (gisr_val & AVE_GI_TX) {
+		napi_schedule(&priv->napi_tx);
+		/* still force to disable Tx interrupt until NAPI finishes */
+		gimr_val &= ~AVE_GI_TX;
+	}
+
+exit_isr:
+	ave_irq_restore(ndev, gimr_val);
+
+	return IRQ_HANDLED;
+}
+
+static int ave_pfsel_start(struct net_device *ndev, unsigned int entry)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 val;
+
+	if (WARN_ON(entry > AVE_PF_SIZE))
+		return -EINVAL;
+
+	val = readl(priv->base + AVE_PFEN);
+	writel(val | BIT(entry), priv->base + AVE_PFEN);
+
+	return 0;
+}
+
+static int ave_pfsel_stop(struct net_device *ndev, unsigned int entry)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 val;
+
+	if (WARN_ON(entry > AVE_PF_SIZE))
+		return -EINVAL;
+
+	val = readl(priv->base + AVE_PFEN);
+	writel(val & ~BIT(entry), priv->base + AVE_PFEN);
+
+	return 0;
+}
+
+static int ave_pfsel_set_macaddr(struct net_device *ndev,
+				 unsigned int entry,
+				 const unsigned char *mac_addr,
+				 unsigned int set_size)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+
+	if (WARN_ON(entry > AVE_PF_SIZE))
+		return -EINVAL;
+	if (WARN_ON(set_size > 6))
+		return -EINVAL;
+
+	ave_pfsel_stop(ndev, entry);
+
+	/* set MAC address for the filter */
+	ave_hw_write_macaddr(ndev, mac_addr,
+			     AVE_PKTF(entry), AVE_PKTF(entry) + 4);
+
+	/* set byte mask */
+	writel(GENMASK(31, set_size) & AVE_PFMBYTE_MASK0,
+	       priv->base + AVE_PFMBYTE(entry));
+	writel(AVE_PFMBYTE_MASK1, priv->base + AVE_PFMBYTE(entry) + 4);
+
+	/* set bit mask filter */
+	writel(AVE_PFMBIT_MASK, priv->base + AVE_PFMBIT(entry));
+
+	/* set selector to ring 0 */
+	writel(0, priv->base + AVE_PFSEL(entry));
+
+	/* restart filter */
+	ave_pfsel_start(ndev, entry);
+
+	return 0;
+}
+
+static void ave_pfsel_set_promisc(struct net_device *ndev,
+				  unsigned int entry, u32 rxring)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+
+	if (WARN_ON(entry > AVE_PF_SIZE))
+		return;
+
+	ave_pfsel_stop(ndev, entry);
+
+	/* set byte mask */
+	writel(AVE_PFMBYTE_MASK0, priv->base + AVE_PFMBYTE(entry));
+	writel(AVE_PFMBYTE_MASK1, priv->base + AVE_PFMBYTE(entry) + 4);
+
+	/* set bit mask filter */
+	writel(AVE_PFMBIT_MASK, priv->base + AVE_PFMBIT(entry));
+
+	/* set selector to rxring */
+	writel(rxring, priv->base + AVE_PFSEL(entry));
+
+	ave_pfsel_start(ndev, entry);
+}
+
+static void ave_pfsel_init(struct net_device *ndev)
+{
+	unsigned char bcast_mac[ETH_ALEN];
+	int i;
+
+	eth_broadcast_addr(bcast_mac);
+
+	for (i = 0; i < AVE_PF_SIZE; i++)
+		ave_pfsel_stop(ndev, i);
+
+	/* promiscious entry, select ring 0 */
+	ave_pfsel_set_promisc(ndev, AVE_PFNUM_FILTER, 0);
+
+	/* unicast entry */
+	ave_pfsel_set_macaddr(ndev, AVE_PFNUM_UNICAST, ndev->dev_addr, 6);
+
+	/* broadcast entry */
+	ave_pfsel_set_macaddr(ndev, AVE_PFNUM_BROADCAST, bcast_mac, 6);
+}
+
+static void ave_phy_adjust_link(struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	struct phy_device *phydev = ndev->phydev;
+	u32 val, txcr, rxcr, rxcr_org;
+	u16 rmt_adv = 0, lcl_adv = 0;
+	u8 cap;
+
+	/* set RGMII speed */
+	val = readl(priv->base + AVE_TXCR);
+	val &= ~(AVE_TXCR_TXSPD_100 | AVE_TXCR_TXSPD_1G);
+
+	if (phy_interface_is_rgmii(phydev) && phydev->speed == SPEED_1000)
+		val |= AVE_TXCR_TXSPD_1G;
+	else if (phydev->speed == SPEED_100)
+		val |= AVE_TXCR_TXSPD_100;
+
+	writel(val, priv->base + AVE_TXCR);
+
+	/* set RMII speed (100M/10M only) */
+	if (!phy_interface_is_rgmii(phydev)) {
+		val = readl(priv->base + AVE_LINKSEL);
+		if (phydev->speed == SPEED_10)
+			val &= ~AVE_LINKSEL_100M;
+		else
+			val |= AVE_LINKSEL_100M;
+		writel(val, priv->base + AVE_LINKSEL);
+	}
+
+	/* check current RXCR/TXCR */
+	rxcr = readl(priv->base + AVE_RXCR);
+	txcr = readl(priv->base + AVE_TXCR);
+	rxcr_org = rxcr;
+
+	if (phydev->duplex) {
+		rxcr |= AVE_RXCR_FDUPEN;
+
+		if (phydev->pause)
+			rmt_adv |= LPA_PAUSE_CAP;
+		if (phydev->asym_pause)
+			rmt_adv |= LPA_PAUSE_ASYM;
+		if (phydev->advertising & ADVERTISED_Pause)
+			lcl_adv |= ADVERTISE_PAUSE_CAP;
+		if (phydev->advertising & ADVERTISED_Asym_Pause)
+			lcl_adv |= ADVERTISE_PAUSE_ASYM;
+
+		cap = mii_resolve_flowctrl_fdx(lcl_adv, rmt_adv);
+		if (cap & FLOW_CTRL_TX)
+			txcr |= AVE_TXCR_FLOCTR;
+		else
+			txcr &= ~AVE_TXCR_FLOCTR;
+		if (cap & FLOW_CTRL_RX)
+			rxcr |= AVE_RXCR_FLOCTR;
+		else
+			rxcr &= ~AVE_RXCR_FLOCTR;
+	} else {
+		rxcr &= ~AVE_RXCR_FDUPEN;
+		rxcr &= ~AVE_RXCR_FLOCTR;
+		txcr &= ~AVE_TXCR_FLOCTR;
+	}
+
+	if (rxcr_org != rxcr) {
+		/* disable Rx mac */
+		writel(rxcr & ~AVE_RXCR_RXEN, priv->base + AVE_RXCR);
+		/* change and enable TX/Rx mac */
+		writel(txcr, priv->base + AVE_TXCR);
+		writel(rxcr, priv->base + AVE_RXCR);
+	}
+
+	phy_print_status(phydev);
+}
+
+static void ave_macaddr_init(struct net_device *ndev)
+{
+	ave_hw_write_macaddr(ndev, ndev->dev_addr, AVE_RXMAC1R, AVE_RXMAC2R);
+
+	/* pfsel unicast entry */
+	ave_pfsel_set_macaddr(ndev, AVE_PFNUM_UNICAST, ndev->dev_addr, 6);
+}
+
+static int ave_init(struct net_device *ndev)
+{
+	struct ethtool_wolinfo wol = { .cmd = ETHTOOL_GWOL };
+	struct ave_private *priv = netdev_priv(ndev);
+	struct device *dev = ndev->dev.parent;
+	struct device_node *np = dev->of_node;
+	struct device_node *mdio_np;
+	struct phy_device *phydev;
+	int ret;
+
+	/* enable clk because of hw access until ndo_open */
+	ret = clk_prepare_enable(priv->clk);
+	if (ret) {
+		dev_err(dev, "can't enable clock\n");
+		return ret;
+	}
+	ret = reset_control_deassert(priv->rst);
+	if (ret) {
+		dev_err(dev, "can't deassert reset\n");
+		goto out_clk_disable;
+	}
+
+	ave_global_reset(ndev);
+
+	mdio_np = of_get_child_by_name(np, "mdio");
+	if (!mdio_np) {
+		dev_err(dev, "mdio node not found\n");
+		ret = -EINVAL;
+		goto out_reset_assert;
+	}
+	ret = of_mdiobus_register(priv->mdio, mdio_np);
+	of_node_put(mdio_np);
+	if (ret) {
+		dev_err(dev, "failed to register mdiobus\n");
+		goto out_reset_assert;
+	}
+
+	phydev = of_phy_get_and_connect(ndev, np, ave_phy_adjust_link);
+	if (!phydev) {
+		dev_err(dev, "could not attach to PHY\n");
+		ret = -ENODEV;
+		goto out_mdio_unregister;
+	}
+
+	priv->phydev = phydev;
+
+	phy_ethtool_get_wol(phydev, &wol);
+	device_set_wakeup_capable(&ndev->dev, !!wol.supported);
+
+	if (!phy_interface_is_rgmii(phydev)) {
+		phydev->supported &= ~PHY_GBIT_FEATURES;
+		phydev->supported |= PHY_BASIC_FEATURES;
+	}
+	phydev->supported |= SUPPORTED_Pause | SUPPORTED_Asym_Pause;
+
+	phy_attached_info(phydev);
+
+	return 0;
+
+out_mdio_unregister:
+	mdiobus_unregister(priv->mdio);
+out_reset_assert:
+	reset_control_assert(priv->rst);
+out_clk_disable:
+	clk_disable_unprepare(priv->clk);
+
+	return ret;
+}
+
+static void ave_uninit(struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+
+	phy_disconnect(priv->phydev);
+	mdiobus_unregister(priv->mdio);
+
+	/* disable clk because of hw access after ndo_stop */
+	reset_control_assert(priv->rst);
+	clk_disable_unprepare(priv->clk);
+}
+
+static int ave_open(struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	int entry;
+	int ret;
+	u32 val;
+
+	ret = request_irq(priv->irq, ave_irq_handler, IRQF_SHARED, ndev->name,
+			  ndev);
+	if (ret)
+		return ret;
+
+	priv->tx.desc = kcalloc(priv->tx.ndesc, sizeof(*priv->tx.desc),
+				GFP_KERNEL);
+	if (!priv->tx.desc) {
+		ret = -ENOMEM;
+		goto out_free_irq;
+	}
+
+	priv->rx.desc = kcalloc(priv->rx.ndesc, sizeof(*priv->rx.desc),
+				GFP_KERNEL);
+	if (!priv->rx.desc) {
+		kfree(priv->tx.desc);
+		ret = -ENOMEM;
+		goto out_free_irq;
+	}
+
+	/* initialize Tx work and descriptor */
+	priv->tx.proc_idx = 0;
+	priv->tx.done_idx = 0;
+	for (entry = 0; entry < priv->tx.ndesc; entry++) {
+		ave_desc_write_cmdsts(ndev, AVE_DESCID_TX, entry, 0);
+		ave_desc_write_addr(ndev, AVE_DESCID_TX, entry, 0);
+	}
+	writel(AVE_TXDC_ADDR_START |
+	       (((priv->tx.ndesc * priv->desc_size) << 16) & AVE_TXDC_SIZE),
+	       priv->base + AVE_TXDC);
+
+	/* initialize Rx work and descriptor */
+	priv->rx.proc_idx = 0;
+	priv->rx.done_idx = 0;
+	for (entry = 0; entry < priv->rx.ndesc; entry++) {
+		if (ave_rxdesc_prepare(ndev, entry))
+			break;
+	}
+	writel(AVE_RXDC0_ADDR_START |
+	       (((priv->rx.ndesc * priv->desc_size) << 16) & AVE_RXDC0_SIZE),
+	       priv->base + AVE_RXDC0);
+
+	ave_desc_switch(ndev, AVE_DESC_START);
+
+	ave_pfsel_init(ndev);
+	ave_macaddr_init(ndev);
+
+	/* set Rx configuration */
+	/* full duplex, enable pause drop, enalbe flow control */
+	val = AVE_RXCR_RXEN | AVE_RXCR_FDUPEN | AVE_RXCR_DRPEN |
+		AVE_RXCR_FLOCTR | (AVE_MAX_ETHFRAME & AVE_RXCR_MPSIZ_MASK);
+	writel(val, priv->base + AVE_RXCR);
+
+	/* set Tx configuration */
+	/* enable flow control, disable loopback */
+	writel(AVE_TXCR_FLOCTR, priv->base + AVE_TXCR);
+
+	/* enable timer, clear EN,INTM, and mask interval unit(BSCK) */
+	val = readl(priv->base + AVE_IIRQC) & AVE_IIRQC_BSCK;
+	val |= AVE_IIRQC_EN0 | (AVE_INTM_COUNT << 16);
+	writel(val, priv->base + AVE_IIRQC);
+
+	val = AVE_GI_RXIINT | AVE_GI_RXOVF | AVE_GI_TX;
+	ave_irq_restore(ndev, val);
+
+	napi_enable(&priv->napi_rx);
+	napi_enable(&priv->napi_tx);
+
+	phy_start(ndev->phydev);
+	phy_start_aneg(ndev->phydev);
+	netif_start_queue(ndev);
+
+	return 0;
+
+out_free_irq:
+	disable_irq(priv->irq);
+	free_irq(priv->irq, ndev);
+
+	return ret;
+}
+
+static int ave_stop(struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	int entry;
+
+	ave_irq_disable_all(ndev);
+	disable_irq(priv->irq);
+	free_irq(priv->irq, ndev);
+
+	netif_tx_disable(ndev);
+	phy_stop(ndev->phydev);
+	napi_disable(&priv->napi_tx);
+	napi_disable(&priv->napi_rx);
+
+	ave_desc_switch(ndev, AVE_DESC_STOP);
+
+	/* free Tx buffer */
+	for (entry = 0; entry < priv->tx.ndesc; entry++) {
+		if (!priv->tx.desc[entry].skbs)
+			continue;
+
+		ave_dma_unmap(ndev, &priv->tx.desc[entry], DMA_TO_DEVICE);
+		dev_kfree_skb_any(priv->tx.desc[entry].skbs);
+		priv->tx.desc[entry].skbs = NULL;
+	}
+	priv->tx.proc_idx = 0;
+	priv->tx.done_idx = 0;
+
+	/* free Rx buffer */
+	for (entry = 0; entry < priv->rx.ndesc; entry++) {
+		if (!priv->rx.desc[entry].skbs)
+			continue;
+
+		ave_dma_unmap(ndev, &priv->rx.desc[entry], DMA_FROM_DEVICE);
+		dev_kfree_skb_any(priv->rx.desc[entry].skbs);
+		priv->rx.desc[entry].skbs = NULL;
+	}
+	priv->rx.proc_idx = 0;
+	priv->rx.done_idx = 0;
+
+	kfree(priv->tx.desc);
+	kfree(priv->rx.desc);
+
+	return 0;
+}
+
+static int ave_start_xmit(struct sk_buff *skb, struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	u32 proc_idx, done_idx, ndesc, cmdsts;
+	int ret, freepkt;
+	dma_addr_t paddr;
+
+	proc_idx = priv->tx.proc_idx;
+	done_idx = priv->tx.done_idx;
+	ndesc = priv->tx.ndesc;
+	freepkt = ((done_idx + ndesc - 1) - proc_idx) % ndesc;
+
+	/* stop queue when not enough entry */
+	if (unlikely(freepkt < 1)) {
+		netif_stop_queue(ndev);
+		return NETDEV_TX_BUSY;
+	}
+
+	/* add padding for short packet */
+	if (skb_put_padto(skb, ETH_ZLEN)) {
+		priv->stats_tx.dropped++;
+		return NETDEV_TX_OK;
+	}
+
+	/* map Tx buffer
+	 * Tx buffer set to the Tx descriptor doesn't have any restriction.
+	 */
+	ret = ave_dma_map(ndev, &priv->tx.desc[proc_idx],
+			  skb->data, skb->len, DMA_TO_DEVICE, &paddr);
+	if (ret) {
+		dev_kfree_skb_any(skb);
+		priv->stats_tx.dropped++;
+		return NETDEV_TX_OK;
+	}
+
+	priv->tx.desc[proc_idx].skbs = skb;
+
+	ave_desc_write_addr(ndev, AVE_DESCID_TX, proc_idx, paddr);
+
+	cmdsts = AVE_STS_OWN | AVE_STS_1ST | AVE_STS_LAST |
+		(skb->len & AVE_STS_PKTLEN_TX_MASK);
+
+	/* set interrupt per AVE_FORCE_TXINTCNT or when queue is stopped */
+	if (!(proc_idx % AVE_FORCE_TXINTCNT) || netif_queue_stopped(ndev))
+		cmdsts |= AVE_STS_INTR;
+
+	/* disable checksum calculation when skb doesn't calurate checksum */
+	if (skb->ip_summed == CHECKSUM_NONE ||
+	    skb->ip_summed == CHECKSUM_UNNECESSARY)
+		cmdsts |= AVE_STS_NOCSUM;
+
+	ave_desc_write_cmdsts(ndev, AVE_DESCID_TX, proc_idx, cmdsts);
+
+	priv->tx.proc_idx = (proc_idx + 1) % ndesc;
+
+	return NETDEV_TX_OK;
+}
+
+static int ave_ioctl(struct net_device *ndev, struct ifreq *ifr, int cmd)
+{
+	return phy_mii_ioctl(ndev->phydev, ifr, cmd);
+}
+
+static const u8 v4multi_macadr[] = { 0x01, 0x00, 0x00, 0x00, 0x00, 0x00 };
+static const u8 v6multi_macadr[] = { 0x33, 0x00, 0x00, 0x00, 0x00, 0x00 };
+
+static void ave_set_rx_mode(struct net_device *ndev)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	struct netdev_hw_addr *hw_adr;
+	int count, mc_cnt;
+	u32 val;
+
+	/* MAC addr filter enable for promiscious mode */
+	mc_cnt = netdev_mc_count(ndev);
+	val = readl(priv->base + AVE_RXCR);
+	if (ndev->flags & IFF_PROMISC || !mc_cnt)
+		val &= ~AVE_RXCR_AFEN;
+	else
+		val |= AVE_RXCR_AFEN;
+	writel(val, priv->base + AVE_RXCR);
+
+	/* set all multicast address */
+	if ((ndev->flags & IFF_ALLMULTI) || mc_cnt > AVE_PF_MULTICAST_SIZE) {
+		ave_pfsel_set_macaddr(ndev, AVE_PFNUM_MULTICAST,
+				      v4multi_macadr, 1);
+		ave_pfsel_set_macaddr(ndev, AVE_PFNUM_MULTICAST + 1,
+				      v6multi_macadr, 1);
+	} else {
+		/* stop all multicast filter */
+		for (count = 0; count < AVE_PF_MULTICAST_SIZE; count++)
+			ave_pfsel_stop(ndev, AVE_PFNUM_MULTICAST + count);
+
+		/* set multicast addresses */
+		count = 0;
+		netdev_for_each_mc_addr(hw_adr, ndev) {
+			if (count == mc_cnt)
+				break;
+			ave_pfsel_set_macaddr(ndev, AVE_PFNUM_MULTICAST + count,
+					      hw_adr->addr, 6);
+			count++;
+		}
+	}
+}
+
+static void ave_get_stats64(struct net_device *ndev,
+			    struct rtnl_link_stats64 *stats)
+{
+	struct ave_private *priv = netdev_priv(ndev);
+	unsigned int start;
+
+	do {
+		start = u64_stats_fetch_begin_irq(&priv->stats_rx.syncp);
+		stats->rx_packets = priv->stats_rx.packets;
+		stats->rx_bytes	  = priv->stats_rx.bytes;
+	} while (u64_stats_fetch_retry_irq(&priv->stats_rx.syncp, start));
+
+	do {
+		start = u64_stats_fetch_begin_irq(&priv->stats_tx.syncp);
+		stats->tx_packets = priv->stats_tx.packets;
+		stats->tx_bytes	  = priv->stats_tx.bytes;
+	} while (u64_stats_fetch_retry_irq(&priv->stats_tx.syncp, start));
+
+	stats->rx_errors      = priv->stats_rx.errors;
+	stats->tx_errors      = priv->stats_tx.errors;
+	stats->rx_dropped     = priv->stats_rx.dropped;
+	stats->tx_dropped     = priv->stats_tx.dropped;
+	stats->rx_fifo_errors = priv->stats_rx.fifo_errors;
+	stats->collisions     = priv->stats_tx.collisions;
+}
+
+static int ave_set_mac_address(struct net_device *ndev, void *p)
+{
+	int ret = eth_mac_addr(ndev, p);
+
+	if (ret)
+		return ret;
+
+	ave_macaddr_init(ndev);
+
+	return 0;
+}
+
+static const struct net_device_ops ave_netdev_ops = {
+	.ndo_init		= ave_init,
+	.ndo_uninit		= ave_uninit,
+	.ndo_open		= ave_open,
+	.ndo_stop		= ave_stop,
+	.ndo_start_xmit		= ave_start_xmit,
+	.ndo_do_ioctl		= ave_ioctl,
+	.ndo_set_rx_mode	= ave_set_rx_mode,
+	.ndo_get_stats64	= ave_get_stats64,
+	.ndo_set_mac_address	= ave_set_mac_address,
+};
+
+static int ave_probe(struct platform_device *pdev)
+{
+	const struct ave_soc_data *data;
+	struct device *dev = &pdev->dev;
+	char buf[ETHTOOL_FWVERS_LEN];
+	phy_interface_t phy_mode;
+	struct ave_private *priv;
+	struct net_device *ndev;
+	struct device_node *np;
+	struct resource	*res;
+	const void *mac_addr;
+	void __iomem *base;
+	u64 dma_mask;
+	int irq, ret;
+	u32 ave_id;
+
+	data = of_device_get_match_data(dev);
+	if (WARN_ON(!data))
+		return -EINVAL;
+
+	np = dev->of_node;
+	phy_mode = of_get_phy_mode(np);
+	if (phy_mode < 0) {
+		dev_err(dev, "phy-mode not found\n");
+		return -EINVAL;
+	}
+	if ((!phy_interface_mode_is_rgmii(phy_mode)) &&
+	    phy_mode != PHY_INTERFACE_MODE_RMII &&
+	    phy_mode != PHY_INTERFACE_MODE_MII) {
+		dev_err(dev, "phy-mode is invalid\n");
+		return -EINVAL;
+	}
+
+	irq = platform_get_irq(pdev, 0);
+	if (irq < 0) {
+		dev_err(dev, "IRQ not found\n");
+		return irq;
+	}
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	base = devm_ioremap_resource(dev, res);
+	if (IS_ERR(base))
+		return PTR_ERR(base);
+
+	ndev = alloc_etherdev(sizeof(struct ave_private));
+	if (!ndev) {
+		dev_err(dev, "can't allocate ethernet device\n");
+		return -ENOMEM;
+	}
+
+	ndev->netdev_ops = &ave_netdev_ops;
+	ndev->ethtool_ops = &ave_ethtool_ops;
+	SET_NETDEV_DEV(ndev, dev);
+
+	ndev->features    |= (NETIF_F_IP_CSUM | NETIF_F_RXCSUM);
+	ndev->hw_features |= (NETIF_F_IP_CSUM | NETIF_F_RXCSUM);
+
+	ndev->max_mtu = AVE_MAX_ETHFRAME - (ETH_HLEN + ETH_FCS_LEN);
+
+	mac_addr = of_get_mac_address(np);
+	if (mac_addr)
+		ether_addr_copy(ndev->dev_addr, mac_addr);
+
+	/* if the mac address is invalid, use random mac address */
+	if (!is_valid_ether_addr(ndev->dev_addr)) {
+		eth_hw_addr_random(ndev);
+		dev_warn(dev, "Using random MAC address: %pM\n",
+			 ndev->dev_addr);
+	}
+
+	priv = netdev_priv(ndev);
+	priv->base = base;
+	priv->irq = irq;
+	priv->ndev = ndev;
+	priv->msg_enable = netif_msg_init(-1, AVE_DEFAULT_MSG_ENABLE);
+	priv->phy_mode = phy_mode;
+	priv->data = data;
+
+	if (IS_DESC_64BIT(priv)) {
+		priv->desc_size = AVE_DESC_SIZE_64;
+		priv->tx.daddr  = AVE_TXDM_64;
+		priv->rx.daddr  = AVE_RXDM_64;
+		dma_mask = DMA_BIT_MASK(64);
+	} else {
+		priv->desc_size = AVE_DESC_SIZE_32;
+		priv->tx.daddr  = AVE_TXDM_32;
+		priv->rx.daddr  = AVE_RXDM_32;
+		dma_mask = DMA_BIT_MASK(32);
+	}
+	ret = dma_set_mask(dev, dma_mask);
+	if (ret)
+		goto out_free_netdev;
+
+	priv->tx.ndesc = AVE_NR_TXDESC;
+	priv->rx.ndesc = AVE_NR_RXDESC;
+
+	u64_stats_init(&priv->stats_tx.syncp);
+	u64_stats_init(&priv->stats_rx.syncp);
+
+	priv->clk = devm_clk_get(dev, NULL);
+	if (IS_ERR(priv->clk)) {
+		ret = PTR_ERR(priv->clk);
+		goto out_free_netdev;
+	}
+
+	priv->rst = devm_reset_control_get_optional_shared(dev, NULL);
+	if (IS_ERR(priv->rst)) {
+		ret = PTR_ERR(priv->rst);
+		goto out_free_netdev;
+	}
+
+	priv->mdio = devm_mdiobus_alloc(dev);
+	if (!priv->mdio) {
+		ret = -ENOMEM;
+		goto out_free_netdev;
+	}
+	priv->mdio->priv = ndev;
+	priv->mdio->parent = dev;
+	priv->mdio->read = ave_mdiobus_read;
+	priv->mdio->write = ave_mdiobus_write;
+	priv->mdio->name = "uniphier-mdio";
+	snprintf(priv->mdio->id, MII_BUS_ID_SIZE, "%s-%x",
+		 pdev->name, pdev->id);
+
+	/* Register as a NAPI supported driver */
+	netif_napi_add(ndev, &priv->napi_rx, ave_napi_poll_rx, priv->rx.ndesc);
+	netif_tx_napi_add(ndev, &priv->napi_tx, ave_napi_poll_tx,
+			  priv->tx.ndesc);
+
+	platform_set_drvdata(pdev, ndev);
+
+	ret = register_netdev(ndev);
+	if (ret) {
+		dev_err(dev, "failed to register netdevice\n");
+		goto out_del_napi;
+	}
+
+	/* get ID and version */
+	ave_id = readl(priv->base + AVE_IDR);
+	ave_hw_read_version(ndev, buf, sizeof(buf));
+
+	dev_info(dev, "Socionext %c%c%c%c Ethernet IP %s (irq=%d, phy=%s)\n",
+		 (ave_id >> 24) & 0xff, (ave_id >> 16) & 0xff,
+		 (ave_id >> 8) & 0xff, (ave_id >> 0) & 0xff,
+		 buf, priv->irq, phy_modes(phy_mode));
+
+	return 0;
+
+out_del_napi:
+	netif_napi_del(&priv->napi_rx);
+	netif_napi_del(&priv->napi_tx);
+out_free_netdev:
+	free_netdev(ndev);
+
+	return ret;
+}
+
+static int ave_remove(struct platform_device *pdev)
+{
+	struct net_device *ndev = platform_get_drvdata(pdev);
+	struct ave_private *priv = netdev_priv(ndev);
+
+	unregister_netdev(ndev);
+	netif_napi_del(&priv->napi_rx);
+	netif_napi_del(&priv->napi_tx);
+	free_netdev(ndev);
+
+	return 0;
+}
+
+static const struct ave_soc_data ave_pro4_data = {
+	.is_desc_64bit = false,
+};
+
+static const struct ave_soc_data ave_pxs2_data = {
+	.is_desc_64bit = false,
+};
+
+static const struct ave_soc_data ave_ld11_data = {
+	.is_desc_64bit = false,
+};
+
+static const struct ave_soc_data ave_ld20_data = {
+	.is_desc_64bit = true,
+};
+
+static const struct of_device_id of_ave_match[] = {
+	{
+		.compatible = "socionext,uniphier-pro4-ave4",
+		.data = &ave_pro4_data,
+	},
+	{
+		.compatible = "socionext,uniphier-pxs2-ave4",
+		.data = &ave_pxs2_data,
+	},
+	{
+		.compatible = "socionext,uniphier-ld11-ave4",
+		.data = &ave_ld11_data,
+	},
+	{
+		.compatible = "socionext,uniphier-ld20-ave4",
+		.data = &ave_ld20_data,
+	},
+	{ /* Sentinel */ }
+};
+MODULE_DEVICE_TABLE(of, of_ave_match);
+
+static struct platform_driver ave_driver = {
+	.probe  = ave_probe,
+	.remove = ave_remove,
+	.driver	= {
+		.name = "ave",
+		.of_match_table	= of_ave_match,
+	},
+};
+module_platform_driver(ave_driver);
+
+MODULE_DESCRIPTION("Socionext UniPhier AVE ethernet driver");
+MODULE_LICENSE("GPL v2");
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH] Staging: ipx: fixed several no space before tabs coding style issues
From: Greg KH @ 2017-12-28  7:17 UTC (permalink / raw)
  To: Jianshen Liu; +Cc: devel, netdev, linux-kernel
In-Reply-To: <1514409944-27794-1-git-send-email-ljishen@gmail.com>

On Wed, Dec 27, 2017 at 09:25:44PM +0000, Jianshen Liu wrote:
> Fixed several coding style warnings of "please, no space before tabs".
> 
> Signed-off-by: Jianshen Liu <ljishen@gmail.com>
> ---
>  drivers/staging/ipx/af_ipx.c    | 56 ++++++++++++++++++++---------------------
>  drivers/staging/ipx/ipx_proc.c  |  2 +-
>  drivers/staging/ipx/ipx_route.c |  6 ++---
>  3 files changed, 32 insertions(+), 32 deletions(-)

Please read drivers/staging/ipx/TODO

^ permalink raw reply

* Re: [PATCH net-next 1/6] phy: add 2.5G SGMII mode to the phy_mode enum
From: Andrew Lunn @ 2017-12-28  7:20 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kishon, jason, sebastian.hesselbarth, gregory.clement,
	linux, mw, stefanc, ymarkman, thomas.petazzoni, miquel.raynal,
	nadavh, netdev, linux-arm-kernel, linux-kernel
In-Reply-To: <20171227221446.18459-2-antoine.tenart@free-electrons.com>

On Wed, Dec 27, 2017 at 11:14:41PM +0100, Antoine Tenart wrote:
> This patch adds one more generic PHY mode to the phy_mode enum, to allow
> configuring generic PHYs to the 2.5G SGMII mode by using the set_mode
> callback.
> 
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> ---
>  include/linux/phy/phy.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/phy/phy.h b/include/linux/phy/phy.h
> index 4f8423a948d5..70459a28f3a1 100644
> --- a/include/linux/phy/phy.h
> +++ b/include/linux/phy/phy.h
> @@ -28,6 +28,7 @@ enum phy_mode {
>  	PHY_MODE_USB_DEVICE,
>  	PHY_MODE_USB_OTG,
>  	PHY_MODE_SGMII,
> +	PHY_MODE_SGMII_2_5G,
>  	PHY_MODE_10GKR,
>  	PHY_MODE_UFS_HS_A,
>  	PHY_MODE_UFS_HS_B,

Hi Antoine

There was a discussion maybe last month about adding 2.5G SGMII. I
would prefer 2500SGMII. Putting the number first makes it uniform with
the other defines, 1000BASEX, 25000BASEX, 10GKR.

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 5/6] arm64: dts: marvell: mcbin: enable the fourth network interface
From: Andrew Lunn @ 2017-12-28  7:46 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: thomas.petazzoni, ymarkman, jason, netdev, Antoine Tenart,
	linux-kernel, kishon, nadavh, miquel.raynal, gregory.clement,
	stefanc, mw, davem, linux-arm-kernel, sebastian.hesselbarth
In-Reply-To: <20171227222401.GT10595@n2100.armlinux.org.uk>

On Wed, Dec 27, 2017 at 10:24:01PM +0000, Russell King - ARM Linux wrote:
> On Wed, Dec 27, 2017 at 11:14:45PM +0100, Antoine Tenart wrote:
> > This patch enables the fourth network interface on the Marvell
> > Macchiatobin. It is configured in the 2500Base-X PHY mode.
> > 
> > Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> > ---
> >  arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> > index b3350827ee55..c51efd511324 100644
> > --- a/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> > +++ b/arch/arm64/boot/dts/marvell/armada-8040-mcbin.dts
> > @@ -279,6 +279,14 @@
> >  	phys = <&cps_comphy0 1>;
> >  };
> >  
> > +&cps_eth2 {
> > +	/* CPS Lane 5 */
> > +	status = "okay";
> > +	phy-mode = "2500base-x";
> > +	/* Generic PHY, providing serdes lanes */
> > +	phys = <&cps_comphy5 2>;
> > +};
> > +
> 
> This is wrong.  This lane is connected to a SFP cage which can support
> more than just 2500base-X.  Tying it in this way to 2500base-X means
> that this port does not support conenctions at 1000base-X, despite
> that's one of the most popular and more standardised speeds.
> 

Hi Antoine

I agree with Russell here. SFP modules are hot pluggable, and support
a range of interface modes. You need to query what the SFP module is
in order to know how to configure the SERDES interface. The phylink
infrastructure does that for you.

   Andrew

^ permalink raw reply

* Re: [RFC PATCH bpf-next v2 4/4] error-injection: Support fault injection framework
From: Masami Hiramatsu @ 2017-12-28  7:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Josef Bacik, rostedt, mingo, davem, netdev,
	linux-kernel, ast, kernel-team, daniel, linux-btrfs, darrick.wong,
	Josef Bacik, Akinobu Mita
In-Reply-To: <126a806d-e913-d243-6678-56fc3cff5e9e@fb.com>

On Wed, 27 Dec 2017 19:49:28 -0800
Alexei Starovoitov <ast@fb.com> wrote:

> On 12/27/17 5:38 PM, Masami Hiramatsu wrote:
> > On Wed, 27 Dec 2017 14:49:46 -0800
> > Alexei Starovoitov <ast@fb.com> wrote:
> >
> >> On 12/27/17 12:09 AM, Masami Hiramatsu wrote:
> >>> On Tue, 26 Dec 2017 18:12:56 -0800
> >>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>>> On Tue, Dec 26, 2017 at 04:48:25PM +0900, Masami Hiramatsu wrote:
> >>>>> Support in-kernel fault-injection framework via debugfs.
> >>>>> This allows you to inject a conditional error to specified
> >>>>> function using debugfs interfaces.
> >>>>>
> >>>>> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> >>>>> ---
> >>>>>  Documentation/fault-injection/fault-injection.txt |    5 +
> >>>>>  kernel/Makefile                                   |    1
> >>>>>  kernel/fail_function.c                            |  169 +++++++++++++++++++++
> >>>>>  lib/Kconfig.debug                                 |   10 +
> >>>>>  4 files changed, 185 insertions(+)
> >>>>>  create mode 100644 kernel/fail_function.c
> >>>>>
> >>>>> diff --git a/Documentation/fault-injection/fault-injection.txt b/Documentation/fault-injection/fault-injection.txt
> >>>>> index 918972babcd8..6243a588dd71 100644
> >>>>> --- a/Documentation/fault-injection/fault-injection.txt
> >>>>> +++ b/Documentation/fault-injection/fault-injection.txt
> >>>>> @@ -30,6 +30,11 @@ o fail_mmc_request
> >>>>>    injects MMC data errors on devices permitted by setting
> >>>>>    debugfs entries under /sys/kernel/debug/mmc0/fail_mmc_request
> >>>>>
> >>>>> +o fail_function
> >>>>> +
> >>>>> +  injects error return on specific functions by setting debugfs entries
> >>>>> +  under /sys/kernel/debug/fail_function. No boot option supported.
> >>>>
> >>>> I like it.
> >>>> Could you document it a bit better?
> >>>
> >>> Yes, I will do in next series.
> >>>
> >>>> In particular retval is configurable, but without an example no one
> >>>> will be able to figure out how to use it.
> >>>
> >>> Ah, right. BTW, as I pointed in the covermail, should we store the
> >>> expected error value range into the injectable list? e.g.
> >>>
> >>> ALLOW_ERROR_INJECTION(open_ctree, -1, -MAX_ERRNO)
> >>>
> >>> And provide APIs to check/get it.
> >>
> >> I'm afraid such check would be too costly.
> >> Right now we have only two functions marked but I expect hundreds more
> >> will be added in the near future as soon as developers realize the
> >> potential of such error injection.
> >> All of ALLOW_ERROR_INJECTION marks add 8 byte overhead each to .data.
> >> Multiple by 1k and we have 8k of data spent on marks.
> >> If we add max/min range marks that doubles it for very little use.
> >> I think marking function only is enough.
> >
> > Sorry, I don't think so.
> > Even if it takes 16 bytes more for each points, I don't think it is
> > any overhead for machines in these days. Even if so, we can provide
> > a kconfig to reduce it.
> > I mean, we are living in GB-order memory are, and it will be bigger
> > in the future. Why we have to worry about hundreds of 16bytes memory
> > pieces? It will take a few KB, and even if we mark thousands of
> > functions, it never reaches 1MB, in GB memory pool. :)
> >
> > Of course, for many small-footprint embedded devices (like having
> > less than 128MB memory), this feature can be a overhead. But they
> > can cut off the table by kconfig.
> 
> I still disagree on wasting 16-byte * num_of_funcs of .data here.
> The trade-off of usability vs memory just not worth it. Sorry.
> Please focus on testing your changes instead.

Then what happen if the user set invalid retval to those functions?
even if we limit the injectable functions, it can cause a problem,

for example, 

 obj = func_return_object();
 if (!obj) {
    handling_error...;
 }
 obj->field = x;

In this case, obviously func_return_object() must return NULL if there is
an error, not -ENOMEM. But without the correct retval information, how would
you check the BPF code doesn't cause a trouble?
Currently it seems you are expecting only the functions which return error code.

 ret = func_return_state();
 if (ret < 0) {
    handling_error...;
 }

But how we can distinguish those?

If we have the error range for each function, we can ensure what is
*correct* error code, NULL or errno, or any other error numbers. :)

At least fail_function needs this feature because it can check
return value when setting it up.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* Re: WARNING in strp_data_ready
From: Dmitry Vyukov @ 2017-12-28  7:54 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Ozgur, John Fastabend, syzbot, David S. Miller, Eric Biggers,
	LKML, Linux Kernel Network Developers,
	syzkaller-bugs@googlegroups.com, Tom Herbert, Cong Wang
In-Reply-To: <CALx6S35ZPyD9aS3c52Z-w4wo4=iK41iJ-+nMpQSGeQ9nMKnPbQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2349 bytes --]

On Thu, Dec 28, 2017 at 2:19 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Wed, Dec 27, 2017 at 12:20 PM, Ozgur <ozgur@goosey.org> wrote:
>>
>>
>> 27.12.2017, 23:14, "Dmitry Vyukov" <dvyukov@google.com>:
>>> On Wed, Dec 27, 2017 at 9:08 PM, Ozgur <ozgur@goosey.org> wrote:
>>>>  27.12.2017, 22:21, "Dmitry Vyukov" <dvyukov@google.com>:
>>>>>  On Wed, Dec 27, 2017 at 8:09 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>>>>   Did you try the patch I posted?
>>>>>
>>>>>  Hi Tom,
>>>>
>>>>  Hello Dmitry,
>>>>
>>>>>  No. And I didn't know I need to. Why?
>>>>>  If you think the patch needs additional testing, you can ask syzbot to
>>>>>  test it. See https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot
>>>>>  Otherwise proceed with committing it. Or what are we waiting for?
>>>>>
>>>>>  Thanks
>>>>
>>>>  I think we need to fixed patch for crash, in fact check to patch code and test solve the bug.
>>>>  How do test it because there is no patch in the following bug?
>>>
>>> Hi Ozgur,
>>>
>>> I am not sure I completely understand what you mean. But the
>>> reproducer for this bug (which one can use for testing) is here:
>>> https://groups.google.com/forum/#!topic/syzkaller-bugs/Kxs05ziCpgY
>>> Tom also mentions there is some patch for this, but I don't know where
>>> it is, it doesn't seem to be referenced from this thread.
>>
>> Hello Dmitry,
>>
>> Ah, I'm sorry I don't seen Tom mail and I don't have a patch not tested :)
>> I think Tom send patch to only you and are you tested?
>>
>> kcmsock.c will change and strp_data_ready I think locked.
>>
>> Tom, please send a patch for me? I can test and inform you.
>>
> Hi Ozgur,
>
> I reposted the patches as RFC "kcm: Fix lockdep issue". Please test if you can!

OK, I will work as your typist this time:

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
master

But I wonder what part of the following you don't understand? Do we
need to improve wording or something?

> If you think the patch needs additional testing, you can ask syzbot to test it.
> See https://github.com/google/syzkaller/blob/master/docs/syzbot.md#communication-with-syzbot

Also I don't know what git repo/branch you have in mind. Kernel
patches don't generally apply just to any tree. Fingers crossed that I
guessed correctly and it will apply.

[-- Attachment #2: strparser.patch --]
[-- Type: text/x-patch, Size: 999 bytes --]

diff --git a/include/net/sock.h b/include/net/sock.h
index 9155da422692..7a7b14e9628a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1514,6 +1514,11 @@ static inline bool sock_owned_by_user(const struct sock *sk)
 	return sk->sk_lock.owned;
 }
 
+static inline bool sock_owned_by_user_nocheck(const struct sock *sk)
+{
+	return sk->sk_lock.owned;
+}
+
 /* no reclassification while locks are held */
 static inline bool sock_allow_reclassification(const struct sock *csk)
 {
diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index c5fda15ba319..1fdab5c4eda8 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -401,7 +401,7 @@ void strp_data_ready(struct strparser *strp)
 	 * allows a thread in BH context to safely check if the process
 	 * lock is held. In this case, if the lock is held, queue work.
 	 */
-	if (sock_owned_by_user(strp->sk)) {
+	if (sock_owned_by_user_nocheck(strp->sk)) {
 		queue_work(strp_wq, &strp->work);
 		return;
 	}

^ permalink raw reply related

* Re: [RFC PATCH bpf-next v2 1/4] tracing/kprobe: bpf: Check error injectable event is on function entry
From: Masami Hiramatsu @ 2017-12-28  7:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Josef Bacik, rostedt, mingo, davem, netdev,
	linux-kernel, ast, kernel-team, daniel, linux-btrfs, darrick.wong,
	Josef Bacik, Akinobu Mita
In-Reply-To: <03e0ebb7-0b2a-4235-3408-c0d59a1ba4c2@fb.com>

On Wed, 27 Dec 2017 19:45:42 -0800
Alexei Starovoitov <ast@fb.com> wrote:

> On 12/27/17 6:34 PM, Masami Hiramatsu wrote:
> > On Wed, 27 Dec 2017 14:46:24 -0800
> > Alexei Starovoitov <ast@fb.com> wrote:
> >
> >> On 12/26/17 9:56 PM, Masami Hiramatsu wrote:
> >>> On Tue, 26 Dec 2017 17:57:32 -0800
> >>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>
> >>>> On Tue, Dec 26, 2017 at 04:46:59PM +0900, Masami Hiramatsu wrote:
> >>>>> Check whether error injectable event is on function entry or not.
> >>>>> Currently it checks the event is ftrace-based kprobes or not,
> >>>>> but that is wrong. It should check if the event is on the entry
> >>>>> of target function. Since error injection will override a function
> >>>>> to just return with modified return value, that operation must
> >>>>> be done before the target function starts making stackframe.
> >>>>>
> >>>>> As a side effect, bpf error injection is no need to depend on
> >>>>> function-tracer. It can work with sw-breakpoint based kprobe
> >>>>> events too.
> >>>>>
> >>>>> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> >>>>> ---
> >>>>>  kernel/trace/Kconfig        |    2 --
> >>>>>  kernel/trace/bpf_trace.c    |    6 +++---
> >>>>>  kernel/trace/trace_kprobe.c |    8 +++++---
> >>>>>  kernel/trace/trace_probe.h  |   12 ++++++------
> >>>>>  4 files changed, 14 insertions(+), 14 deletions(-)
> >>>>>
> >>>>> diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> >>>>> index ae3a2d519e50..6400e1bf97c5 100644
> >>>>> --- a/kernel/trace/Kconfig
> >>>>> +++ b/kernel/trace/Kconfig
> >>>>> @@ -533,9 +533,7 @@ config FUNCTION_PROFILER
> >>>>>  config BPF_KPROBE_OVERRIDE
> >>>>>  	bool "Enable BPF programs to override a kprobed function"
> >>>>>  	depends on BPF_EVENTS
> >>>>> -	depends on KPROBES_ON_FTRACE
> >>>>>  	depends on HAVE_KPROBE_OVERRIDE
> >>>>> -	depends on DYNAMIC_FTRACE_WITH_REGS
> >>>>>  	default n
> >>>>>  	help
> >>>>>  	 Allows BPF to override the execution of a probed function and
> >>>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> >>>>> index f6d2327ecb59..d663660f8392 100644
> >>>>> --- a/kernel/trace/bpf_trace.c
> >>>>> +++ b/kernel/trace/bpf_trace.c
> >>>>> @@ -800,11 +800,11 @@ int perf_event_attach_bpf_prog(struct perf_event *event,
> >>>>>  	int ret = -EEXIST;
> >>>>>
> >>>>>  	/*
> >>>>> -	 * Kprobe override only works for ftrace based kprobes, and only if they
> >>>>> -	 * are on the opt-in list.
> >>>>> +	 * Kprobe override only works if they are on the function entry,
> >>>>> +	 * and only if they are on the opt-in list.
> >>>>>  	 */
> >>>>>  	if (prog->kprobe_override &&
> >>>>> -	    (!trace_kprobe_ftrace(event->tp_event) ||
> >>>>> +	    (!trace_kprobe_on_func_entry(event->tp_event) ||
> >>>>>  	     !trace_kprobe_error_injectable(event->tp_event)))
> >>>>>  		return -EINVAL;
> >>>>>
> >>>>> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> >>>>> index 91f4b57dab82..265e3e27e8dc 100644
> >>>>> --- a/kernel/trace/trace_kprobe.c
> >>>>> +++ b/kernel/trace/trace_kprobe.c
> >>>>> @@ -88,13 +88,15 @@ static nokprobe_inline unsigned long trace_kprobe_nhit(struct trace_kprobe *tk)
> >>>>>  	return nhit;
> >>>>>  }
> >>>>>
> >>>>> -int trace_kprobe_ftrace(struct trace_event_call *call)
> >>>>> +bool trace_kprobe_on_func_entry(struct trace_event_call *call)
> >>>>>  {
> >>>>>  	struct trace_kprobe *tk = (struct trace_kprobe *)call->data;
> >>>>> -	return kprobe_ftrace(&tk->rp.kp);
> >>>>> +
> >>>>> +	return kprobe_on_func_entry(tk->rp.kp.addr, tk->rp.kp.symbol_name,
> >>>>> +				    tk->rp.kp.offset);
> >>>>
> >>>> That would be nice, but did you test this?
> >>>
> >>> Yes, because the jprobe, which was only official user of modifying execution
> >>> path using kprobe, did same way to check. (and kretprobe also does it)
> >>>
> >>>> My understanding that kprobe will restore all regs and
> >>>> here we need to override return ip _and_ value.
> >>>
> >>> yes, no problem. kprobe restore all regs from pt_regs, including regs->ip.
> >>>
> >>>> Could you add a patch with the test the way Josef did
> >>>> or describe the steps to test this new mode?
> >>>
> >>> Would you mean below patch? If so, it should work without any change.
> >>>
> >>>  [PATCH v10 4/5] samples/bpf: add a test for bpf_override_return
> >>
> >> yeah. I expect bpf_override_return test to work as-is.
> >> I'm asking for the test for new functionality added by this patch.
> >> In particular kprobe on func entry without ftrace.
> >> How did you test it?
> >
> > This function is used in kretprobe and jprobe. Jprobe was the user of
> > "modifying instruction pointer to another function" in kprobes.
> > If it doesn't work, jprobe also doesn't work, this means you can not
> > modify IP by kprobes anymore.
> > Anyway, until linux-4.13, that was well tested by kprobe smoke test.
> >
> >> and how I can repeat the test?
> >> I'm still not sure that it works correctly.
> >
> > That works correctly because it checks given address is on the entry
> > point (the 1st instruction) of a function, using kallsyms.
> >
> > The reason why I made another flag for ftrace was, there are 2 modes
> > for ftrace dynamic instrumentation, fentry and mcount.
> > With new fentry mode, ftrace will be put on the first instruction
> > of the function, so it will work as you expected.
> > With traditional gcc mcount, ftrace will be called after making call
> > frame for _mcount(). This means if you modify ip, it will not work
> > or cause a trouble because _mcount call frame is still on the stack.
> >
> > So, current ftrace-based checker doesn't work, it depends on the case.
> > Of course, in most case, kernel will be build in new gcc which
> > supports fentry, but there is no guarantee.
> 
> I don't think that's the case. My reading of current
> trace_kprobe_ftrace() -> arch_check_ftrace_location()
> is that it will not be true for old mcount case.
> 
> As far as the rest of your arguments it very much puzzles me that
> you claim that this patch suppose to work based on historical
> reasoning whereas you did NOT test it.

No, even with this patch, I meant, you can still test it with Josef's
testcase, since this is just backend change. No frontend changes.

Thank you,

-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply

* [PATCH bpf-next v3 0/3] libbpf: add XDP setup support
From: Eric Leblond @ 2017-12-28  8:04 UTC (permalink / raw)
  To: daniel; +Cc: Alexei Starovoitov, netdev, linux-kernel
In-Reply-To: <b2cf1e8d-fc1d-4ffb-efec-fafe641c3a49@iogearbox.net>

Hello,

This patchset adds support for loading eBPF code as XDP filter.

Best regards,
--
Eric Leblond

^ permalink raw reply

* [PATCH bpf-next v3 1/3] libbpf: add function to setup XDP
From: Eric Leblond @ 2017-12-28  8:04 UTC (permalink / raw)
  To: daniel; +Cc: Alexei Starovoitov, netdev, linux-kernel, Eric Leblond
In-Reply-To: <20171228080437.16933-1-eric@regit.org>

Most of the code is taken from set_link_xdp_fd() in bpf_load.c and
slightly modified to be library compliant.

Signed-off-by: Eric Leblond <eric@regit.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/lib/bpf/bpf.c    | 126 ++++++++++++++++++++++++++++++++++++++++++++++++-
 tools/lib/bpf/libbpf.c |   2 +
 tools/lib/bpf/libbpf.h |   4 ++
 3 files changed, 130 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 5128677e4117..1e3cfe6b9fce 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -25,6 +25,16 @@
 #include <asm/unistd.h>
 #include <linux/bpf.h>
 #include "bpf.h"
+#include "libbpf.h"
+#include <linux/rtnetlink.h>
+#include <sys/socket.h>
+#include <errno.h>
+
+#ifndef IFLA_XDP_MAX
+#define IFLA_XDP	43
+#define IFLA_XDP_FD	1
+#define IFLA_XDP_FLAGS	3
+#endif
 
 /*
  * When building perf, unistd.h is overridden. __NR_bpf is
@@ -46,8 +56,6 @@
 # endif
 #endif
 
-#define min(x, y) ((x) < (y) ? (x) : (y))
-
 static inline __u64 ptr_to_u64(const void *ptr)
 {
 	return (__u64) (unsigned long) ptr;
@@ -413,3 +421,117 @@ int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len)
 
 	return err;
 }
+
+int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
+{
+	struct sockaddr_nl sa;
+	int sock, seq = 0, len, ret = -1;
+	char buf[4096];
+	struct nlattr *nla, *nla_xdp;
+	struct {
+		struct nlmsghdr  nh;
+		struct ifinfomsg ifinfo;
+		char             attrbuf[64];
+	} req;
+	struct nlmsghdr *nh;
+	struct nlmsgerr *err;
+	socklen_t addrlen;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.nl_family = AF_NETLINK;
+
+	sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+	if (sock < 0) {
+		return -errno;
+	}
+
+	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
+		ret = -errno;
+		goto cleanup;
+	}
+
+	addrlen = sizeof(sa);
+	if (getsockname(sock, (struct sockaddr *)&sa, &addrlen) < 0) {
+		ret = errno;
+		goto cleanup;
+	}
+
+	if (addrlen != sizeof(sa)) {
+		ret = errno;
+		goto cleanup;
+	}
+
+	memset(&req, 0, sizeof(req));
+	req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	req.nh.nlmsg_type = RTM_SETLINK;
+	req.nh.nlmsg_pid = 0;
+	req.nh.nlmsg_seq = ++seq;
+	req.ifinfo.ifi_family = AF_UNSPEC;
+	req.ifinfo.ifi_index = ifindex;
+
+	/* started nested attribute for XDP */
+	nla = (struct nlattr *)(((char *)&req)
+				+ NLMSG_ALIGN(req.nh.nlmsg_len));
+	nla->nla_type = NLA_F_NESTED | IFLA_XDP;
+	nla->nla_len = NLA_HDRLEN;
+
+	/* add XDP fd */
+	nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+	nla_xdp->nla_type = IFLA_XDP_FD;
+	nla_xdp->nla_len = NLA_HDRLEN + sizeof(int);
+	memcpy((char *)nla_xdp + NLA_HDRLEN, &fd, sizeof(fd));
+	nla->nla_len += nla_xdp->nla_len;
+
+	/* if user passed in any flags, add those too */
+	if (flags) {
+		nla_xdp = (struct nlattr *)((char *)nla + nla->nla_len);
+		nla_xdp->nla_type = IFLA_XDP_FLAGS;
+		nla_xdp->nla_len = NLA_HDRLEN + sizeof(flags);
+		memcpy((char *)nla_xdp + NLA_HDRLEN, &flags, sizeof(flags));
+		nla->nla_len += nla_xdp->nla_len;
+	}
+
+	req.nh.nlmsg_len += NLA_ALIGN(nla->nla_len);
+
+	if (send(sock, &req, req.nh.nlmsg_len, 0) < 0) {
+		ret = -errno;
+		goto cleanup;
+	}
+
+	len = recv(sock, buf, sizeof(buf), 0);
+	if (len < 0) {
+		ret = -errno;
+		goto cleanup;
+	}
+
+	for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+	     nh = NLMSG_NEXT(nh, len)) {
+		if (nh->nlmsg_pid != sa.nl_pid) {
+			ret = -LIBBPF_ERRNO__WRNGPID;
+			goto cleanup;
+		}
+		if (nh->nlmsg_seq != seq) {
+			ret = -LIBBPF_ERRNO__INVSEQ;
+			goto cleanup;
+		}
+		switch (nh->nlmsg_type) {
+		case NLMSG_ERROR:
+			err = (struct nlmsgerr *)NLMSG_DATA(nh);
+			if (!err->error)
+				continue;
+			ret = err->error;
+			goto cleanup;
+		case NLMSG_DONE:
+			break;
+		default:
+			break;
+		}
+	}
+
+	ret = 0;
+
+cleanup:
+	close(sock);
+	return ret;
+}
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e9c4b7cabcf2..5fe8aaa2123e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -106,6 +106,8 @@ static const char *libbpf_strerror_table[NR_ERRNO] = {
 	[ERRCODE_OFFSET(PROG2BIG)]	= "Program too big",
 	[ERRCODE_OFFSET(KVER)]		= "Incorrect kernel version",
 	[ERRCODE_OFFSET(PROGTYPE)]	= "Kernel doesn't support this program type",
+	[ERRCODE_OFFSET(WRNGPID)]	= "Wrong pid in netlink message",
+	[ERRCODE_OFFSET(INVSEQ)]	= "Invalid netlink sequence",
 };
 
 int libbpf_strerror(int err, char *buf, size_t size)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 6e20003109e0..e42f96900318 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -42,6 +42,8 @@ enum libbpf_errno {
 	LIBBPF_ERRNO__PROG2BIG,	/* Program too big */
 	LIBBPF_ERRNO__KVER,	/* Incorrect kernel version */
 	LIBBPF_ERRNO__PROGTYPE,	/* Kernel doesn't support this program type */
+	LIBBPF_ERRNO__WRNGPID,	/* Wrong pid in netlink message */
+	LIBBPF_ERRNO__INVSEQ,	/* Invalid netlink sequence */
 	__LIBBPF_ERRNO__END,
 };
 
@@ -246,4 +248,6 @@ long libbpf_get_error(const void *ptr);
 
 int bpf_prog_load(const char *file, enum bpf_prog_type type,
 		  struct bpf_object **pobj, int *prog_fd);
+
+int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
 #endif
-- 
2.15.1

^ permalink raw reply related

* [PATCH bpf-next v3 2/3] libbpf: add error reporting in XDP
From: Eric Leblond @ 2017-12-28  8:04 UTC (permalink / raw)
  To: daniel; +Cc: Alexei Starovoitov, netdev, linux-kernel, Eric Leblond
In-Reply-To: <20171228080437.16933-1-eric@regit.org>

Parse netlink ext attribute to get the error message returned by
the card. Code is partially take from libnl.

Signed-off-by: Eric Leblond <eric@regit.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
---
 tools/lib/bpf/Build    |   2 +-
 tools/lib/bpf/bpf.c    |   9 +++
 tools/lib/bpf/nlattr.c | 187 +++++++++++++++++++++++++++++++++++++++++++++++++
 tools/lib/bpf/nlattr.h |  69 ++++++++++++++++++
 4 files changed, 266 insertions(+), 1 deletion(-)
 create mode 100644 tools/lib/bpf/nlattr.c
 create mode 100644 tools/lib/bpf/nlattr.h

diff --git a/tools/lib/bpf/Build b/tools/lib/bpf/Build
index d8749756352d..64c679d67109 100644
--- a/tools/lib/bpf/Build
+++ b/tools/lib/bpf/Build
@@ -1 +1 @@
-libbpf-y := libbpf.o bpf.o
+libbpf-y := libbpf.o bpf.o nlattr.o
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 1e3cfe6b9fce..cdfabbe118cc 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -26,6 +26,7 @@
 #include <linux/bpf.h>
 #include "bpf.h"
 #include "libbpf.h"
+#include "nlattr.h"
 #include <linux/rtnetlink.h>
 #include <sys/socket.h>
 #include <errno.h>
@@ -436,6 +437,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
 	struct nlmsghdr *nh;
 	struct nlmsgerr *err;
 	socklen_t addrlen;
+	int one;
 
 	memset(&sa, 0, sizeof(sa));
 	sa.nl_family = AF_NETLINK;
@@ -445,6 +447,12 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
 		return -errno;
 	}
 
+	if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK,
+		       &one, sizeof(one)) < 0) {
+		/* debug/verbose message that it is not supported */
+		fprintf(stderr, "Netlink error reporting not supported\n");
+	}
+
 	if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) {
 		ret = -errno;
 		goto cleanup;
@@ -521,6 +529,7 @@ int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
 			if (!err->error)
 				continue;
 			ret = err->error;
+			nla_dump_errormsg(nh);
 			goto cleanup;
 		case NLMSG_DONE:
 			break;
diff --git a/tools/lib/bpf/nlattr.c b/tools/lib/bpf/nlattr.c
new file mode 100644
index 000000000000..5cc74fa98049
--- /dev/null
+++ b/tools/lib/bpf/nlattr.c
@@ -0,0 +1,187 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
+/*
+ * NETLINK      Netlink attributes
+ *
+ *	This library is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU Lesser General Public
+ *	License as published by the Free Software Foundation version 2.1
+ *	of the License.
+ *
+ * Copyright (c) 2003-2013 Thomas Graf <tgraf@suug.ch>
+ */
+
+#include <errno.h>
+#include "nlattr.h"
+#include <linux/rtnetlink.h>
+#include <string.h>
+#include <stdio.h>
+
+static uint16_t nla_attr_minlen[NLA_TYPE_MAX+1] = {
+	[NLA_U8]	= sizeof(uint8_t),
+	[NLA_U16]	= sizeof(uint16_t),
+	[NLA_U32]	= sizeof(uint32_t),
+	[NLA_U64]	= sizeof(uint64_t),
+	[NLA_STRING]	= 1,
+	[NLA_FLAG]	= 0,
+};
+
+static int nla_len(const struct nlattr *nla)
+{
+	return nla->nla_len - NLA_HDRLEN;
+}
+
+static struct nlattr *nla_next(const struct nlattr *nla, int *remaining)
+{
+	int totlen = NLA_ALIGN(nla->nla_len);
+
+	*remaining -= totlen;
+	return (struct nlattr *) ((char *) nla + totlen);
+}
+
+static int nla_ok(const struct nlattr *nla, int remaining)
+{
+	return remaining >= sizeof(*nla) &&
+	       nla->nla_len >= sizeof(*nla) &&
+	       nla->nla_len <= remaining;
+}
+
+static void *nla_data(const struct nlattr *nla)
+{
+	return (char *) nla + NLA_HDRLEN;
+}
+
+static int nla_type(const struct nlattr *nla)
+{
+	return nla->nla_type & NLA_TYPE_MASK;
+}
+
+static int validate_nla(struct nlattr *nla, int maxtype,
+			struct nla_policy *policy)
+{
+	struct nla_policy *pt;
+	unsigned int minlen = 0;
+	int type = nla_type(nla);
+
+	if (type < 0 || type > maxtype)
+		return 0;
+
+	pt = &policy[type];
+
+	if (pt->type > NLA_TYPE_MAX)
+		return 0;
+
+	if (pt->minlen)
+		minlen = pt->minlen;
+	else if (pt->type != NLA_UNSPEC)
+		minlen = nla_attr_minlen[pt->type];
+
+	if (nla_len(nla) < minlen)
+		return -1;
+
+	if (pt->maxlen && nla_len(nla) > pt->maxlen)
+		return -1;
+
+	if (pt->type == NLA_STRING) {
+		char *data = nla_data(nla);
+		if (data[nla_len(nla) - 1] != '\0')
+			return -1;
+	}
+
+	return 0;
+}
+
+static inline int nlmsg_len(const struct nlmsghdr *nlh)
+{
+	return nlh->nlmsg_len - NLMSG_HDRLEN;
+}
+
+/**
+ * Create attribute index based on a stream of attributes.
+ * @arg tb		Index array to be filled (maxtype+1 elements).
+ * @arg maxtype		Maximum attribute type expected and accepted.
+ * @arg head		Head of attribute stream.
+ * @arg len		Length of attribute stream.
+ * @arg policy		Attribute validation policy.
+ *
+ * Iterates over the stream of attributes and stores a pointer to each
+ * attribute in the index array using the attribute type as index to
+ * the array. Attribute with a type greater than the maximum type
+ * specified will be silently ignored in order to maintain backwards
+ * compatibility. If \a policy is not NULL, the attribute will be
+ * validated using the specified policy.
+ *
+ * @see nla_validate
+ * @return 0 on success or a negative error code.
+ */
+static int nla_parse(struct nlattr *tb[], int maxtype, struct nlattr *head, int len,
+		     struct nla_policy *policy)
+{
+	struct nlattr *nla;
+	int rem, err;
+
+	memset(tb, 0, sizeof(struct nlattr *) * (maxtype + 1));
+
+	nla_for_each_attr(nla, head, len, rem) {
+		int type = nla_type(nla);
+
+		if (type > maxtype)
+			continue;
+
+		if (policy) {
+			err = validate_nla(nla, maxtype, policy);
+			if (err < 0)
+				goto errout;
+		}
+
+		if (tb[type])
+			fprintf(stderr, "Attribute of type %#x found multiple times in message, "
+				  "previous attribute is being ignored.\n", type);
+
+		tb[type] = nla;
+	}
+
+	err = 0;
+errout:
+	return err;
+}
+
+/* dump netlink extended ack error message */
+int nla_dump_errormsg(struct nlmsghdr *nlh)
+{
+	struct nla_policy extack_policy[NLMSGERR_ATTR_MAX + 1] = {
+		[NLMSGERR_ATTR_MSG]	= { .type = NLA_STRING },
+		[NLMSGERR_ATTR_OFFS]	= { .type = NLA_U32 },
+	};
+	struct nlattr *tb[NLMSGERR_ATTR_MAX + 1], *attr;
+	struct nlmsgerr *err;
+	char *errmsg = NULL;
+	int hlen, alen;
+
+	/* no TLVs, nothing to do here */
+	if (!(nlh->nlmsg_flags & NLM_F_ACK_TLVS))
+		return 0;
+
+	err = (struct nlmsgerr *)NLMSG_DATA(nlh);
+	hlen = sizeof(*err);
+
+	/* if NLM_F_CAPPED is set then the inner err msg was capped */
+	if (!(nlh->nlmsg_flags & NLM_F_CAPPED))
+		hlen += nlmsg_len(&err->msg);
+
+	attr = (struct nlattr *) ((void *) err + hlen);
+	alen = nlh->nlmsg_len - hlen;
+
+	if (nla_parse(tb, NLMSGERR_ATTR_MAX, attr, alen, extack_policy) != 0) {
+		fprintf(stderr,
+			"Failed to parse extended error attributes\n");
+		return 0;
+	}
+
+	if (tb[NLMSGERR_ATTR_MSG])
+		errmsg = (char *) nla_data(tb[NLMSGERR_ATTR_MSG]);
+
+	fprintf(stderr, "Kernel error message: %s\n", errmsg);
+
+	return 0;
+}
diff --git a/tools/lib/bpf/nlattr.h b/tools/lib/bpf/nlattr.h
new file mode 100644
index 000000000000..fa2d015334ef
--- /dev/null
+++ b/tools/lib/bpf/nlattr.h
@@ -0,0 +1,69 @@
+/* SPDX-License-Identifier: LGPL-2.1 */
+
+/*
+ * NETLINK      Netlink attributes
+ *
+ *	This library is free software; you can redistribute it and/or
+ *	modify it under the terms of the GNU Lesser General Public
+ *	License as published by the Free Software Foundation version 2.1
+ *	of the License.
+ *
+ * Copyright (c) 2003-2013 Thomas Graf <tgraf@suug.ch>
+ */
+
+#ifndef __NLATTR_H
+#define __NLATTR_H
+
+#include <linux/netlink.h>
+
+/**
+ * Standard attribute types to specify validation policy
+ */
+enum {
+	NLA_UNSPEC,	/**< Unspecified type, binary data chunk */
+	NLA_U8,		/**< 8 bit integer */
+	NLA_U16,	/**< 16 bit integer */
+	NLA_U32,	/**< 32 bit integer */
+	NLA_U64,	/**< 64 bit integer */
+	NLA_STRING,	/**< NUL terminated character string */
+	NLA_FLAG,	/**< Flag */
+	NLA_MSECS,	/**< Micro seconds (64bit) */
+	NLA_NESTED,	/**< Nested attributes */
+	__NLA_TYPE_MAX,
+};
+
+#define NLA_TYPE_MAX (__NLA_TYPE_MAX - 1)
+
+/**
+ * @ingroup attr
+ * Attribute validation policy.
+ *
+ * See section @core_doc{core_attr_parse,Attribute Parsing} for more details.
+ */
+struct nla_policy {
+	/** Type of attribute or NLA_UNSPEC */
+	uint16_t	type;
+
+	/** Minimal length of payload required */
+	uint16_t	minlen;
+
+	/** Maximal length of payload allowed */
+	uint16_t	maxlen;
+};
+
+/**
+ * @ingroup attr
+ * Iterate over a stream of attributes
+ * @arg pos	loop counter, set to current attribute
+ * @arg head	head of attribute stream
+ * @arg len	length of attribute stream
+ * @arg rem	initialized to len, holds bytes currently remaining in stream
+ */
+#define nla_for_each_attr(pos, head, len, rem) \
+	for (pos = head, rem = len; \
+	     nla_ok(pos, rem); \
+	     pos = nla_next(pos, &(rem)))
+
+int nla_dump_errormsg(struct nlmsghdr *nlh);
+
+#endif /* __NLATTR_H */
-- 
2.15.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox