Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH iproute2-next v2 1/3] ll_map: Add function to remove link cache entry by index
From: David Ahern @ 2019-02-14  0:22 UTC (permalink / raw)
  To: stephen; +Cc: netdev, David Ahern
In-Reply-To: <20190214002249.31866-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Add ll_drop_by_index to remove an entry from the link cache.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/ll_map.h |  1 +
 lib/ll_map.c     | 14 ++++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/include/ll_map.h b/include/ll_map.h
index 511fe00b8567..4de1041e2746 100644
--- a/include/ll_map.h
+++ b/include/ll_map.h
@@ -9,6 +9,7 @@ unsigned ll_name_to_index(const char *name);
 const char *ll_index_to_name(unsigned idx);
 int ll_index_to_type(unsigned idx);
 int ll_index_to_flags(unsigned idx);
+void ll_drop_by_index(unsigned index);
 unsigned namehash(const char *str);
 
 const char *ll_idx_n2a(unsigned int idx);
diff --git a/lib/ll_map.c b/lib/ll_map.c
index 1ab8ef0758ac..8e8a0b1e9c9d 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -210,6 +210,20 @@ unsigned ll_name_to_index(const char *name)
 	return idx;
 }
 
+void ll_drop_by_index(unsigned index)
+{
+	struct ll_cache *im;
+
+	im = ll_get_by_index(index);
+	if (!im)
+		return;
+
+	hlist_del(&im->idx_hash);
+	hlist_del(&im->name_hash);
+
+	free(im);
+}
+
 void ll_init_map(struct rtnl_handle *rth)
 {
 	static int initialized;
-- 
2.11.0


^ permalink raw reply related

* [PATCH iproute2-next v2 3/3] Improve batch and dump times by caching link lookups
From: David Ahern @ 2019-02-14  0:22 UTC (permalink / raw)
  To: stephen; +Cc: netdev, David Ahern
In-Reply-To: <20190214002249.31866-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

ip route uses ll_name_to_index and ll_index_to_name to convert between
device names and indices. At the moment both use for the ioctl based glibc
functions if_nametoindex and if_indextoname and does not cache the result.
When using a batch file or dumping large number of routes this means the
same device lookups can be done repeatedly adding unnecessary overhead
(socket + ioctl + close for each device lookup).

Add a new function, ll_link_get, to send a netlink based RTM_GETLINK. If
successful, cache the result in idx_head and name_head so future lookups
can re-use the entry. Update ll_name_to_index and ll_index_to_name to use
ll_link_get and only fallback to the glibc functions if it fails.

With this change the time to install 720,022 routes with 2 ecmp nexthops
where the nexthop device is given is reduced from 31.4 seconds to 19.2
seconds. A dump of those routes drops from 13.3 to 2.8 seconds.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 lib/ll_map.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 51 insertions(+), 1 deletion(-)

diff --git a/lib/ll_map.c b/lib/ll_map.c
index 8e8a0b1e9c9d..2d7b65dcb8f7 100644
--- a/lib/ll_map.c
+++ b/lib/ll_map.c
@@ -152,6 +152,48 @@ static unsigned int ll_idx_a2n(const char *name)
 	return idx;
 }
 
+static int ll_link_get(const char *name, int index)
+{
+	struct {
+		struct nlmsghdr		n;
+		struct ifinfomsg	ifm;
+		char			buf[1024];
+	} req = {
+		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+		.n.nlmsg_flags = NLM_F_REQUEST,
+		.n.nlmsg_type = RTM_GETLINK,
+		.ifm.ifi_index = index,
+	};
+	__u32 filt_mask = RTEXT_FILTER_VF | RTEXT_FILTER_SKIP_STATS;
+	struct rtnl_handle rth = {};
+	struct nlmsghdr *answer;
+	int rc = 0;
+
+	if (rtnl_open(&rth, 0) < 0)
+		return 0;
+
+	addattr32(&req.n, sizeof(req), IFLA_EXT_MASK, filt_mask);
+	if (name)
+		addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name,
+			  strlen(name) + 1);
+
+	if (rtnl_talk(&rth, &req.n, &answer) < 0)
+		goto out;
+
+	/* add entry to cache */
+	rc  = ll_remember_index(answer, NULL);
+	if (!rc) {
+		struct ifinfomsg *ifm = NLMSG_DATA(answer);
+
+		rc = ifm->ifi_index;
+	}
+
+	free(answer);
+out:
+	rtnl_close(&rth);
+	return rc;
+}
+
 const char *ll_index_to_name(unsigned int idx)
 {
 	static char buf[IFNAMSIZ];
@@ -164,6 +206,12 @@ const char *ll_index_to_name(unsigned int idx)
 	if (im)
 		return im->name;
 
+	if (ll_link_get(NULL, idx) == idx) {
+		im = ll_get_by_index(idx);
+		if (im)
+			return im->name;
+	}
+
 	if (if_indextoname(idx, buf) == NULL)
 		snprintf(buf, IFNAMSIZ, "if%u", idx);
 
@@ -204,7 +252,9 @@ unsigned ll_name_to_index(const char *name)
 	if (im)
 		return im->index;
 
-	idx = if_nametoindex(name);
+	idx = ll_link_get(name, 0);
+	if (idx == 0)
+		idx = if_nametoindex(name);
 	if (idx == 0)
 		idx = ll_idx_a2n(name);
 	return idx;
-- 
2.11.0


^ permalink raw reply related

* [PATCH iproute2-next v2 0/3] Improve batch and dump times by caching link lookups
From: David Ahern @ 2019-02-14  0:22 UTC (permalink / raw)
  To: stephen; +Cc: netdev, David Ahern

From: David Ahern <dsahern@gmail.com>

Many commands convert device names to an index using ll_name_to_index and
the reverse from an index to a name using ll_index_to_name.

At the moment both of the ll_ functions use the ioctl based helpers from
glibc which involves opening socket, calling ioctl and then closing the
socket on each device lookup. When using a batch file or dumping large
number of routes this means the same device lookups can be done repeatedly
adding unnecessary overhead to both operations.

This series adds a new function, ll_link_get, to send a netlink based
RTM_GETLINK. If successful, the result is cached in idx_head and name_head
so future lookups can re-use the entry. iproute2's ll_map functions are
updated to use ll_link_get over the glibc functions. The result is a
significant speed up in both batch and dumps with negligible overhead if
ip is invoked for single operations.

The first 2 patches add a means to drop an entry from the cache and updates
iplink_modify to use that new function to drop entries on device changes.
This forces the cache to re-learn device information if a batch file has a
mix of link set operations with other commands - such as adding a route.

v2
- changed the second patch to drop cache entry on any link changes
- added ll_link_get to index to name conversion improving dumps

David Ahern (3):
  ll_map: Add function to remove link cache entry by index
  ip link: Drop cache entry on any changes
  Improve batch and dump times by caching link lookups

 include/ll_map.h |  1 +
 ip/iplink.c      |  3 +++
 lib/ll_map.c     | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 69 insertions(+), 1 deletion(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH iproute2-next v2 2/3] ip link: Drop cache entry on any changes
From: David Ahern @ 2019-02-14  0:22 UTC (permalink / raw)
  To: stephen; +Cc: netdev, David Ahern
In-Reply-To: <20190214002249.31866-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Remove any entry from the link cache when the link is modified.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 ip/iplink.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/ip/iplink.c b/ip/iplink.c
index b5519201fef7..393cefdc89df 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -1083,6 +1083,9 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -2;
 
+	/* remove device from cache; next use can refresh with new data */
+	ll_drop_by_index(req.i.ifi_index);
+
 	return 0;
 }
 
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH net] sctp: call gso_reset_checksum when computing checksum in sctp_gso_segment
From: David Miller @ 2019-02-14  0:33 UTC (permalink / raw)
  To: lucien.xin; +Cc: linux-kernel, netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <5b8187d1eabd52e4db7d3e4506d98c33571c1c83.1549968450.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Tue, 12 Feb 2019 18:47:30 +0800

> Jianlin reported a panic when running sctp gso over gre over vlan device:
 ...
> It was caused by SKB_GSO_CB(skb)->csum_start not set in sctp_gso_segment.
> sctp_gso_segment() calls skb_segment() with 'feature | NETIF_F_HW_CSUM',
> which causes SKB_GSO_CB(skb)->csum_start not to be set in skb_segment().
> 
> For TCP/UDP, when feature supports HW_CSUM, CHECKSUM_PARTIAL will be set
> and gso_reset_checksum will be called to set SKB_GSO_CB(skb)->csum_start.
> 
> So SCTP should do the same as TCP/UDP, to call gso_reset_checksum() when
> computing checksum in sctp_gso_segment.
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net] sctp: set stream ext to NULL after freeing it in sctp_stream_outq_migrate
From: David Miller @ 2019-02-14  0:34 UTC (permalink / raw)
  To: lucien.xin; +Cc: linux-kernel, netdev, linux-sctp, marcelo.leitner, nhorman
In-Reply-To: <0cb9e543c21495df48c3723044d6c9f64f238eca.1549968661.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Tue, 12 Feb 2019 18:51:01 +0800

> In sctp_stream_init(), after sctp_stream_outq_migrate() freed the
> surplus streams' ext, but sctp_stream_alloc_out() returns -ENOMEM,
> stream->outcnt will not be set to 'outcnt'.
> 
> With the bigger value on stream->outcnt, when closing the assoc and
> freeing its streams, the ext of those surplus streams will be freed
> again since those stream exts were not set to NULL after freeing in
> sctp_stream_outq_migrate(). Then the invalid-free issue reported by
> syzbot would be triggered.
> 
> We fix it by simply setting them to NULL after freeing.
> 
> Fixes: 5bbbbe32a431 ("sctp: introduce stream scheduler foundations")
> Reported-by: syzbot+58e480e7b28f2d890bfd@syzkaller.appspotmail.com
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Also applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH bpf-next 4/4] selftests/bpf: Test static data relocation
From: Alexei Starovoitov @ 2019-02-14  0:35 UTC (permalink / raw)
  To: Joe Stringer; +Cc: bpf, netdev, Daniel Borkmann, ast
In-Reply-To: <CAOftzPjffHZUK=uxiTsuOGLX4A5ag5oqRz6gorhREj6hxMFA4A@mail.gmail.com>

On Tue, Feb 12, 2019 at 12:43:21PM -0800, Joe Stringer wrote:
> 
> Do you see any value in having incremental support in libbpf that
> could be used as a fallback for older kernels like in patch #2 of this
> series? I could imagine libbpf probing kernel support for
> global/static variables and attempting to handle references to .data
> via some more comprehensive mechanism in-kernel, or falling back to
> this approach if it is not available.

I don't think we have to view it as older vs new kernel and fallback discussion.
I think access to static vars can be implemented in libbpf today without
changing llvm and kernel.

For the following code:
static volatile __u32 static_data = 42;

SEC("anything")
int load_static_data(struct __sk_buff *skb)
{
   __u32 value = static_data;

llvm will generate asm:

  r1 = static_data ll
  r1 = *(u32 *)(r1 + 0)

libbpf can replace first insn with r1 = 0 (or remove it altogether)
and second insn with r1 = 42 _when it is safe_.

If there was no volatile keyword llvm would have optimized
these two instructions into operation with immediate constant.
libbpf can do this opimization instead of llvm.
libbpf can check that 'static_data' is indeed not global in elf file
and there are no store operations in all programs in that elf file.
Then every load from that address can be replaced with rX=imm
of the value from data section.
libbpf would need to do data flow analysis which is substantial
feature addition. I think it's inevitable next step anyway.

The key point that this approach will be compatible with future
global variables and modifiable static variables.
In such case load/store instructions will stay as-is
and kernel support will be needed to replace 'r1 = static_data ll'
with properly marked ld_imm64 insn.

^ permalink raw reply

* Re: [PATCH bpf-next v11 0/7] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap
From: David Ahern @ 2019-02-14  0:46 UTC (permalink / raw)
  To: Peter Oskolkov, Alexei Starovoitov, Daniel Borkmann, netdev
  Cc: Peter Oskolkov, Willem de Bruijn
In-Reply-To: <20190213195341.184969-1-posk@google.com>

On 2/13/19 12:53 PM, Peter Oskolkov wrote:
> This patchset implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
> BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
> and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
> to packets (e.g. IP/GRE, GUE, IPIP).
> 
> This is useful when thousands of different short-lived flows should be
> encapped, each with different and dynamically determined destination.
> Although lwtunnels can be used in some of these scenarios, the ability
> to dynamically generate encap headers adds more flexibility, e.g.
> when routing depends on the state of the host (reflected in global bpf
> maps).
> 


For the set:
Reviewed-by: David Ahern <dsahern@gmail.com>



^ permalink raw reply

* [PATCH iproute2] ss: Render buffer to output every time a number of chunks are allocated
From: Stefano Brivio @ 2019-02-14  0:58 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Phil Sutter, David Ahern, Sabrina Dubroca, netdev

Eric reported that, with 10 million sockets, ss -emoi (about 1000 bytes
output per socket) can easily lead to OOM (buffer would grow to 10GB of
memory).

Limit the maximum size of the buffer to five chunks, 1M each. Render and
flush buffers whenever we reach that.

This might make the resulting blocks slightly unaligned between them, with
occasional loss of readability on lines occurring every 5k to 50k sockets
approximately. Something like (from ss -tu):

[...]
CLOSE-WAIT   32       0           192.168.1.50:35232           10.0.0.1:https
ESTAB        0        0           192.168.1.50:53820           10.0.0.1:https
ESTAB       0        0           192.168.1.50:46924            10.0.0.1:https
CLOSE-WAIT  32       0           192.168.1.50:35228            10.0.0.1:https
[...]

However, I don't actually expect any human user to scroll through that
amount of sockets, so readability should be preserved when it matters.

The bulk of the diffstat comes from moving field_next() around, as we now
call render() from it. Functionally, this is implemented by six lines of
code, most of them in field_next().

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 691bd854bf4a ("ss: Buffer raw fields first, then render them as a table")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
---
Eric, it would be nice if you could test this with your bazillion sockets,
I checked this with -emoi and "only" 500,000 sockets.

 misc/ss.c | 68 ++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 40 insertions(+), 28 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 9e821faf0d31..28bdcba72d73 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -52,7 +52,8 @@
 #include <linux/tipc_sockets_diag.h>
 
 #define MAGIC_SEQ 123456
-#define BUF_CHUNK (1024 * 1024)
+#define BUF_CHUNK (1024 * 1024)	/* Buffer chunk allocation size */
+#define BUF_CHUNKS_MAX 5	/* Maximum number of allocated buffer chunks */
 #define LEN_ALIGN(x) (((x) + 1) & ~1)
 
 #define DIAG_REQUEST(_req, _r)						    \
@@ -176,6 +177,7 @@ static struct {
 	struct buf_token *cur;	/* Position of current token in chunk */
 	struct buf_chunk *head;	/* First chunk */
 	struct buf_chunk *tail;	/* Current chunk */
+	int chunks;		/* Number of allocated chunks */
 } buffer;
 
 static const char *TCP_PROTO = "tcp";
@@ -936,6 +938,8 @@ static struct buf_chunk *buf_chunk_new(void)
 
 	new->end = buffer.cur->data;
 
+	buffer.chunks++;
+
 	return new;
 }
 
@@ -1080,33 +1084,6 @@ static int field_is_last(struct column *f)
 	return f - columns == COL_MAX - 1;
 }
 
-static void field_next(void)
-{
-	field_flush(current_field);
-
-	if (field_is_last(current_field))
-		current_field = columns;
-	else
-		current_field++;
-}
-
-/* Walk through fields and flush them until we reach the desired one */
-static void field_set(enum col_id id)
-{
-	while (id != current_field - columns)
-		field_next();
-}
-
-/* Print header for all non-empty columns */
-static void print_header(void)
-{
-	while (!field_is_last(current_field)) {
-		if (!current_field->disabled)
-			out("%s", current_field->header);
-		field_next();
-	}
-}
-
 /* Get the next available token in the buffer starting from the current token */
 static struct buf_token *buf_token_next(struct buf_token *cur)
 {
@@ -1132,6 +1109,7 @@ static void buf_free_all(void)
 		free(tmp);
 	}
 	buffer.head = NULL;
+	buffer.chunks = 0;
 }
 
 /* Get current screen width, default to 80 columns if TIOCGWINSZ fails */
@@ -1294,6 +1272,40 @@ static void render(void)
 	current_field = columns;
 }
 
+/* Move to next field, and render buffer if we reached the maximum number of
+ * chunks, at the last field in a line.
+ */
+static void field_next(void)
+{
+	if (field_is_last(current_field) && buffer.chunks >= BUF_CHUNKS_MAX) {
+		render();
+		return;
+	}
+
+	field_flush(current_field);
+	if (field_is_last(current_field))
+		current_field = columns;
+	else
+		current_field++;
+}
+
+/* Walk through fields and flush them until we reach the desired one */
+static void field_set(enum col_id id)
+{
+	while (id != current_field - columns)
+		field_next();
+}
+
+/* Print header for all non-empty columns */
+static void print_header(void)
+{
+	while (!field_is_last(current_field)) {
+		if (!current_field->disabled)
+			out("%s", current_field->header);
+		field_next();
+	}
+}
+
 static void sock_state_print(struct sockstat *s)
 {
 	const char *sock_name;
-- 
2.20.1


^ permalink raw reply related

* [PATCH] tcp: Namespace-ify sysctl_tcp_rmem and sysctl_tcp_wmem
From: Alakesh Haloi @ 2019-02-14  1:12 UTC (permalink / raw)
  To: stable
  Cc: David S. Miller, Alexey Kuznetsov, Hideaki YOSHIFUJI,
	Eric Dumazet, netdev, linux-kernel

[ Upstream commit 356d1833b638bd465672aefeb71def3ab93fc17d ]

Note that when a new netns is created, it inherits its
sysctl_tcp_rmem and sysctl_tcp_wmem from initial netns.

This change is needed so that we can refine TCP rcvbuf autotuning,
to take RTT into consideration.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
[alakeshh: backport to v4.14: The patch does not apply to v4.14
directly and hence needed manual backport. Function signature for
the function tcp_select_initial_window had to be changed to be able
to pass pointer to struct sock.]
Signed-off-by: Alakesh Haloi <alakeshh@amazon.com>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: <stable@vger.kernel.org> # 4.14.x
---
 include/net/netns/ipv4.h   |  2 ++
 include/net/sock.h         |  3 +++
 include/net/tcp.h          |  5 ++---
 net/ipv4/syncookies.c      |  2 +-
 net/ipv4/sysctl_net_ipv4.c | 34 +++++++++++++++++-----------------
 net/ipv4/tcp.c             | 21 ++++++++-------------
 net/ipv4/tcp_input.c       | 17 +++++++++++------
 net/ipv4/tcp_ipv4.c        | 12 ++++++++++--
 net/ipv4/tcp_minisocks.c   |  2 +-
 net/ipv4/tcp_output.c      |  7 ++++---
 net/ipv6/syncookies.c      |  2 +-
 net/ipv6/tcp_ipv6.c        |  4 ++--
 12 files changed, 62 insertions(+), 49 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 8fcff2837484..ea48e5b8dbda 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -126,6 +126,8 @@ struct netns_ipv4 {
 	int sysctl_tcp_sack;
 	int sysctl_tcp_window_scaling;
 	int sysctl_tcp_timestamps;
+	int sysctl_tcp_wmem[3];
+	int sysctl_tcp_rmem[3];
 	struct inet_timewait_death_row tcp_death_row;
 	int sysctl_max_syn_backlog;
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 4280e96d4b46..cec9b63a482a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1095,8 +1095,11 @@ struct proto {
 	 */
 	unsigned long		*memory_pressure;
 	long			*sysctl_mem;
+
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
+	u32                     sysctl_wmem_offset;
+	u32                     sysctl_rmem_offset;
 	int			max_header;
 	bool			no_autobind;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0c828aac7e04..a234f0d83184 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -251,8 +251,6 @@ extern int sysctl_tcp_reordering;
 extern int sysctl_tcp_max_reordering;
 extern int sysctl_tcp_dsack;
 extern long sysctl_tcp_mem[3];
-extern int sysctl_tcp_wmem[3];
-extern int sysctl_tcp_rmem[3];
 extern int sysctl_tcp_app_win;
 extern int sysctl_tcp_adv_win_scale;
 extern int sysctl_tcp_frto;
@@ -1322,7 +1320,8 @@ static inline void tcp_slow_start_after_idle_check(struct sock *sk)
 }
 
 /* Determine a window scaling and initial window to offer. */
-void tcp_select_initial_window(int __space, __u32 mss, __u32 *rcv_wnd,
+void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
+			       __u32 *rcv_wnd,
 			       __u32 *window_clamp, int wscale_ok,
 			       __u8 *rcv_wscale, __u32 init_rcv_wnd);
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 77cf32a80952..fda37f2862c9 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -385,7 +385,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	/* Try to redo what tcp_v4_send_synack did. */
 	req->rsk_window_clamp = tp->window_clamp ? :dst_metric(&rt->dst, RTAX_WINDOW);
 
-	tcp_select_initial_window(tcp_full_space(sk), req->mss,
+	tcp_select_initial_window(sk, tcp_full_space(sk), req->mss,
 				  &req->rsk_rcv_wnd, &req->rsk_window_clamp,
 				  ireq->wscale_ok, &rcv_wscale,
 				  dst_metric(&rt->dst, RTAX_INITRWND));
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d82e8344fc54..0a518d3fdd5a 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -508,22 +508,6 @@ static struct ctl_table ipv4_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
 	},
-	{
-		.procname	= "tcp_wmem",
-		.data		= &sysctl_tcp_wmem,
-		.maxlen		= sizeof(sysctl_tcp_wmem),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &one,
-	},
-	{
-		.procname	= "tcp_rmem",
-		.data		= &sysctl_tcp_rmem,
-		.maxlen		= sizeof(sysctl_tcp_rmem),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &one,
-	},
 	{
 		.procname	= "tcp_app_win",
 		.data		= &sysctl_tcp_app_win,
@@ -1152,7 +1136,23 @@ static struct ctl_table ipv4_net_table[] = {
 		.data		= &init_net.ipv4.sysctl_tcp_timestamps,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "tcp_wmem",
+		.data		= &init_net.ipv4.sysctl_tcp_wmem,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_wmem),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
+	},
+	{
+		.procname	= "tcp_rmem",
+		.data		= &init_net.ipv4.sysctl_tcp_rmem,
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_rmem),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one,
 	},
 	{ }
 };
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index fd14501ac3af..57db728ec5f7 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -290,12 +290,8 @@ struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
 long sysctl_tcp_mem[3] __read_mostly;
-int sysctl_tcp_wmem[3] __read_mostly;
-int sysctl_tcp_rmem[3] __read_mostly;
 
 EXPORT_SYMBOL(sysctl_tcp_mem);
-EXPORT_SYMBOL(sysctl_tcp_rmem);
-EXPORT_SYMBOL(sysctl_tcp_wmem);
 
 atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
 EXPORT_SYMBOL(tcp_memory_allocated);
@@ -449,9 +445,8 @@ void tcp_init_sock(struct sock *sk)
 
 	icsk->icsk_sync_mss = tcp_sync_mss;
 
-	sk->sk_sndbuf = sysctl_tcp_wmem[1];
-	sk->sk_rcvbuf = sysctl_tcp_rmem[1];
-
+	sk->sk_sndbuf = sock_net(sk)->ipv4.sysctl_tcp_wmem[1];
+	sk->sk_rcvbuf = sock_net(sk)->ipv4.sysctl_tcp_rmem[1];
 	sk_sockets_allocated_inc(sk);
 }
 EXPORT_SYMBOL(tcp_init_sock);
@@ -3538,13 +3533,13 @@ void __init tcp_init(void)
 	max_wshare = min(4UL*1024*1024, limit);
 	max_rshare = min(6UL*1024*1024, limit);
 
-	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
-	sysctl_tcp_wmem[1] = 16*1024;
-	sysctl_tcp_wmem[2] = max(64*1024, max_wshare);
+	init_net.ipv4.sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
+	init_net.ipv4.sysctl_tcp_wmem[1] = 16 * 1024;
+	init_net.ipv4.sysctl_tcp_wmem[2] = max(64 * 1024, max_wshare);
 
-	sysctl_tcp_rmem[0] = SK_MEM_QUANTUM;
-	sysctl_tcp_rmem[1] = 87380;
-	sysctl_tcp_rmem[2] = max(87380, max_rshare);
+	init_net.ipv4.sysctl_tcp_rmem[0] = SK_MEM_QUANTUM;
+	init_net.ipv4.sysctl_tcp_rmem[1] = 87380;
+	init_net.ipv4.sysctl_tcp_rmem[2] = max(87380, max_rshare);
 
 	pr_info("Hash tables configured (established %u bind %u)\n",
 		tcp_hashinfo.ehash_mask + 1, tcp_hashinfo.bhash_size);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e24c0d7adf65..19b59488d4d5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -340,7 +340,8 @@ static void tcp_sndbuf_expand(struct sock *sk)
 	sndmem *= nr_segs * per_mss;
 
 	if (sk->sk_sndbuf < sndmem)
-		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
+		sk->sk_sndbuf = min(sndmem,
+				    sock_net(sk)->ipv4.sysctl_tcp_wmem[2]);
 }
 
 /* 2. Tuning advertised window (window_clamp, rcv_ssthresh)
@@ -372,9 +373,10 @@ static void tcp_sndbuf_expand(struct sock *sk)
 static int __tcp_grow_window(const struct sock *sk, const struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct net *net = sock_net(sk);
 	/* Optimize this! */
 	int truesize = tcp_win_from_space(skb->truesize) >> 1;
-	int window = tcp_win_from_space(sysctl_tcp_rmem[2]) >> 1;
+	int window = tcp_win_from_space(net->ipv4.sysctl_tcp_rmem[2]) >> 1;
 
 	while (tp->rcv_ssthresh <= window) {
 		if (truesize <= skb->len)
@@ -417,6 +419,7 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
 static void tcp_fixup_rcvbuf(struct sock *sk)
 {
 	u32 mss = tcp_sk(sk)->advmss;
+	struct net *net = sock_net(sk);
 	int rcvmem;
 
 	rcvmem = 2 * SKB_TRUESIZE(mss + MAX_TCP_HEADER) *
@@ -429,7 +432,7 @@ static void tcp_fixup_rcvbuf(struct sock *sk)
 		rcvmem <<= 2;
 
 	if (sk->sk_rcvbuf < rcvmem)
-		sk->sk_rcvbuf = min(rcvmem, sysctl_tcp_rmem[2]);
+		sk->sk_rcvbuf = min(rcvmem, net->ipv4.sysctl_tcp_rmem[2]);
 }
 
 /* 4. Try to fixup all. It is made immediately after connection enters
@@ -476,15 +479,16 @@ static void tcp_clamp_window(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct inet_connection_sock *icsk = inet_csk(sk);
+	struct net *net = sock_net(sk);
 
 	icsk->icsk_ack.quick = 0;
 
-	if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
+	if (sk->sk_rcvbuf < net->ipv4.sysctl_tcp_rmem[2] &&
 	    !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
 	    !tcp_under_memory_pressure(sk) &&
 	    sk_memory_allocated(sk) < sk_prot_mem_limits(sk, 0)) {
 		sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
-				    sysctl_tcp_rmem[2]);
+				    net->ipv4.sysctl_tcp_rmem[2]);
 	}
 	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
 		tp->rcv_ssthresh = min(tp->window_clamp, 2U * tp->advmss);
@@ -647,7 +651,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
 			rcvmem += 128;
 
 		do_div(rcvwin, tp->advmss);
-		rcvbuf = min_t(u64, rcvwin * rcvmem, sysctl_tcp_rmem[2]);
+		rcvbuf = min_t(u64, rcvwin * rcvmem,
+			       sock_net(sk)->ipv4.sysctl_tcp_rmem[2]);
 		if (rcvbuf > sk->sk_rcvbuf) {
 			sk->sk_rcvbuf = rcvbuf;
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 31b34c0c2d5f..ae7409861b7d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2428,8 +2428,8 @@ struct proto tcp_prot = {
 	.memory_allocated	= &tcp_memory_allocated,
 	.memory_pressure	= &tcp_memory_pressure,
 	.sysctl_mem		= sysctl_tcp_mem,
-	.sysctl_wmem		= sysctl_tcp_wmem,
-	.sysctl_rmem		= sysctl_tcp_rmem,
+	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
+	.sysctl_rmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_rmem),
 	.max_header		= MAX_TCP_HEADER,
 	.obj_size		= sizeof(struct tcp_sock),
 	.slab_flags		= SLAB_TYPESAFE_BY_RCU,
@@ -2509,6 +2509,14 @@ static int __net_init tcp_sk_init(struct net *net)
 	net->ipv4.sysctl_tcp_sack = 1;
 	net->ipv4.sysctl_tcp_window_scaling = 1;
 	net->ipv4.sysctl_tcp_timestamps = 1;
+	if (net != &init_net) {
+		memcpy(net->ipv4.sysctl_tcp_rmem,
+		       init_net.ipv4.sysctl_tcp_rmem,
+		       sizeof(init_net.ipv4.sysctl_tcp_rmem));
+		memcpy(net->ipv4.sysctl_tcp_wmem,
+		       init_net.ipv4.sysctl_tcp_wmem,
+		       sizeof(init_net.ipv4.sysctl_tcp_wmem));
+	}
 
 	return 0;
 fail:
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 61584638dba7..e50139d51ed2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -378,7 +378,7 @@ void tcp_openreq_init_rwin(struct request_sock *req,
 		full_space = rcv_wnd * mss;
 
 	/* tcp_full_space because it is guaranteed to be the first packet */
-	tcp_select_initial_window(full_space,
+	tcp_select_initial_window(sk_listener, full_space,
 		mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
 		&req->rsk_rcv_wnd,
 		&req->rsk_window_clamp,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 24bad638c2ec..a87d44a80c7d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -208,12 +208,13 @@ u32 tcp_default_init_rwnd(u32 mss)
  * be a multiple of mss if possible. We assume here that mss >= 1.
  * This MUST be enforced by all callers.
  */
-void tcp_select_initial_window(int __space, __u32 mss,
+void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
 			       __u32 *rcv_wnd, __u32 *window_clamp,
 			       int wscale_ok, __u8 *rcv_wscale,
 			       __u32 init_rcv_wnd)
 {
 	unsigned int space = (__space < 0 ? 0 : __space);
+	struct net *net = sock_net(sk);
 
 	/* If no clamp set the clamp to the max possible scaled window */
 	if (*window_clamp == 0)
@@ -240,7 +241,7 @@ void tcp_select_initial_window(int __space, __u32 mss,
 	(*rcv_wscale) = 0;
 	if (wscale_ok) {
 		/* Set window scaling on max possible window */
-		space = max_t(u32, space, sysctl_tcp_rmem[2]);
+		space = max_t(u32, space, net->ipv4.sysctl_tcp_rmem[2]);
 		space = max_t(u32, space, sysctl_rmem_max);
 		space = min_t(u32, space, *window_clamp);
 		while (space > U16_MAX && (*rcv_wscale) < TCP_MAX_WSCALE) {
@@ -3331,7 +3332,7 @@ static void tcp_connect_init(struct sock *sk)
 	if (rcv_wnd == 0)
 		rcv_wnd = dst_metric(dst, RTAX_INITRWND);
 
-	tcp_select_initial_window(tcp_full_space(sk),
+	tcp_select_initial_window(sk, tcp_full_space(sk),
 				  tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0),
 				  &tp->rcv_wnd,
 				  &tp->window_clamp,
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 4e7817abc0b9..e7a3a6b6cf56 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -244,7 +244,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	}
 
 	req->rsk_window_clamp = tp->window_clamp ? :dst_metric(dst, RTAX_WINDOW);
-	tcp_select_initial_window(tcp_full_space(sk), req->mss,
+	tcp_select_initial_window(sk, tcp_full_space(sk), req->mss,
 				  &req->rsk_rcv_wnd, &req->rsk_window_clamp,
 				  ireq->wscale_ok, &rcv_wscale,
 				  dst_metric(dst, RTAX_INITRWND));
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index ba8586aadffa..de89bcee62d7 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1940,8 +1940,8 @@ struct proto tcpv6_prot = {
 	.memory_pressure	= &tcp_memory_pressure,
 	.orphan_count		= &tcp_orphan_count,
 	.sysctl_mem		= sysctl_tcp_mem,
-	.sysctl_wmem		= sysctl_tcp_wmem,
-	.sysctl_rmem		= sysctl_tcp_rmem,
+	.sysctl_wmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_wmem),
+	.sysctl_rmem_offset	= offsetof(struct net, ipv4.sysctl_tcp_rmem),
 	.max_header		= MAX_TCP_HEADER,
 	.obj_size		= sizeof(struct tcp6_sock),
 	.slab_flags		= SLAB_TYPESAFE_BY_RCU,
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net] net: phy: fix interrupt handling in non-started states
From: Andrew Lunn @ 2019-02-14  1:17 UTC (permalink / raw)
  To: Heiner Kallweit
  Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org,
	Russell King - ARM Linux
In-Reply-To: <25e86edc-0b88-8c03-b692-776e971331f2@gmail.com>

On Tue, Feb 12, 2019 at 07:56:15PM +0100, Heiner Kallweit wrote:
> phylib enables interrupts before phy_start() has been called, and if
> we receive an interrupt in a non-started state, the interrupt handler
> returns IRQ_NONE. This causes problems with at least one Marvell chip
> as reported by Andrew.
> Fix this by handling interrupts the same as in phy_mac_interrupt(),
> basically always running the phylib state machine. It knows when it
> has to do something and when not.
> This change allows to handle interrupts gracefully even if they
> occur in a non-started state.
> 
> Fixes: 2b3e88ea6528 ("net: phy: improve phy state checking")
> Reported-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net] dsa: mv88e6xxx: Ensure all pending interrupts are handled prior to exit
From: Andrew Lunn @ 2019-02-14  2:07 UTC (permalink / raw)
  To: John David Anglin; +Cc: Russell King, Vivien Didelot, Florian Fainelli, netdev
In-Reply-To: <6a1ebc61-3505-beb8-21cb-ea42ad9fe67e@bell.net>

On Mon, Feb 11, 2019 at 01:40:21PM -0500, John David Anglin wrote:
> The GPIO interrupt controller on the espressobin board only supports edge interrupts.
> If one enables the use of hardware interrupts in the device tree for the 88E6341, it is
> possible to miss an edge.  When this happens, the INTn pin on the Marvell switch is
> stuck low and no further interrupts occur.
> 
> I found after adding debug statements to mv88e6xxx_g1_irq_thread_work() that there is
> a race in handling device interrupts (e.g. PHY link interrupts).  Some interrupts are
> directly cleared by reading the Global 1 status register.  However, the device interrupt
> flag, for example, is not cleared until all the unmasked SERDES and PHY ports are serviced.
> This is done by reading the relevant SERDES and PHY status register.
> 
> The code only services interrupts whose status bit is set at the time of reading its status
> register.  If an interrupt event occurs after its status is read and before all interrupts
> are serviced, then this event will not be serviced and the INTn output pin will remain low.
> 
> This is not a problem with polling or level interrupts since the handler will be called
> again to process the event.  However, it's a big problem when using level interrupts.
> 
> The fix presented here is to add a loop around the code servicing switch interrupts.  If
> any pending interrupts remain after the current set has been handled, we loop and process
> the new set.  If there are no pending interrupts after servicing, we are sure that INTn has
> gone high and we will get an edge when a new event occurs.
> 
> Tested on espressobin board.
> 
> Signed-off-by:  John David Anglin <dave.anglin@bell.net>

Fixes: dc30c35be720 ("net: dsa: mv88e6xxx: Implement interrupt support.")

Tested-by: Andrew Lunn <andrew@lunn.ch>

David, please ensure that Heiner's patch:

net: phy: fix interrupt handling in non-started states

is applied first. Otherwise we can get into an interrupt storm.

    Andrew

^ permalink raw reply

* [PATCH net-next v2] bonding: check slave set command firstly
From: xiangxia.m.yue @ 2019-02-11 18:49 UTC (permalink / raw)
  To: davem; +Cc: netdev, Tonghao Zhang

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

This patch is a little improvement. If user use the
command shown as below, we should print the info [1]
instead of [2]. The eth0 exists actually, and it may
confuse user.

$ echo "eth0" > /sys/class/net/bond4/bonding/slaves

[1] "bond4: no command found in slaves file - use +ifname or -ifname"
[2] "write error: No such device"

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
v2: fix compiling warning.
---
 drivers/net/bonding/bond_options.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
index 4d5d01c..da1fc17 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -1375,6 +1375,7 @@ static int bond_option_slaves_set(struct bonding *bond,
 	sscanf(newval->string, "%16s", command); /* IFNAMSIZ*/
 	ifname = command + 1;
 	if ((strlen(command) <= 1) ||
+	    (command[0] != '+' && command[0] != '-') ||
 	    !dev_valid_name(ifname))
 		goto err_no_cmd;
 
@@ -1398,6 +1399,7 @@ static int bond_option_slaves_set(struct bonding *bond,
 		break;
 
 	default:
+		/* should not run here. */
 		goto err_no_cmd;
 	}
 
-- 
1.8.3.1


^ permalink raw reply related

* Re: [PATCH bpf-next v11 0/7] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap
From: Alexei Starovoitov @ 2019-02-14  2:39 UTC (permalink / raw)
  To: David Ahern
  Cc: Peter Oskolkov, Alexei Starovoitov, Daniel Borkmann, netdev,
	Peter Oskolkov, Willem de Bruijn
In-Reply-To: <783b5578-cba4-904d-4ade-c8c08b47a3ba@gmail.com>

On Wed, Feb 13, 2019 at 05:46:26PM -0700, David Ahern wrote:
> On 2/13/19 12:53 PM, Peter Oskolkov wrote:
> > This patchset implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
> > BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
> > and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
> > to packets (e.g. IP/GRE, GUE, IPIP).
> > 
> > This is useful when thousands of different short-lived flows should be
> > encapped, each with different and dynamically determined destination.
> > Although lwtunnels can be used in some of these scenarios, the ability
> > to dynamically generate encap headers adds more flexibility, e.g.
> > when routing depends on the state of the host (reflected in global bpf
> > maps).
> > 
> 
> 
> For the set:
> Reviewed-by: David Ahern <dsahern@gmail.com>

Applied. Thanks everyone!


^ permalink raw reply

* [PATCH net-next v5] ipmr: ip6mr: Create new sockopt to clear mfc cache or vifs
From: Callum Sinclair @ 2019-02-14  2:44 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, nikolay, netdev, linux-kernel
  Cc: nicolas.dichtel, Callum Sinclair
In-Reply-To: <20190214024418.21490-1-callum.sinclair@alliedtelesis.co.nz>

Currently the only way to clear the forwarding cache was to delete the
entries one by one using the MRT_DEL_MFC socket option or to destroy and
recreate the socket.

Create a new socket option which with the use of optional flags can
clear any combination of multicast entries (static or not static) and
multicast vifs (static or not static).

Calling the new socket option MRT_FLUSH with the flags MRT_FLUSH_MFC and
MRT_FLUSH_VIFS will clear all entries and vifs on the socket except for
static entries.

Signed-off-by: Callum Sinclair <callum.sinclair@alliedtelesis.co.nz>
---
v1 -> v2:
  Implemented additional flags for static entries
v2 -> v3:
  Cleaned up flag logic so any combination of routes can be cleared.
  Fixed style errors
  Fixed incorrect flag values
v3 -> v4:
  Fixed style errors
  Fixed incorrect flag (MRT_FLUSH was used instead of MRT_FLUSH_VIFS)
v4 -> v5:
  Only clear the unresolved queue when MRT_FLUSH_MFC flag is set.

 include/uapi/linux/mroute.h  |  9 ++++-
 include/uapi/linux/mroute6.h |  9 ++++-
 net/ipv4/ipmr.c              | 75 +++++++++++++++++++++-------------
 net/ipv6/ip6mr.c             | 78 +++++++++++++++++++++++-------------
 4 files changed, 115 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/mroute.h b/include/uapi/linux/mroute.h
index 5d37a9ccce63..11c8c1fc1124 100644
--- a/include/uapi/linux/mroute.h
+++ b/include/uapi/linux/mroute.h
@@ -28,12 +28,19 @@
 #define MRT_TABLE	(MRT_BASE+9)	/* Specify mroute table ID		*/
 #define MRT_ADD_MFC_PROXY	(MRT_BASE+10)	/* Add a (*,*|G) mfc entry	*/
 #define MRT_DEL_MFC_PROXY	(MRT_BASE+11)	/* Del a (*,*|G) mfc entry	*/
-#define MRT_MAX		(MRT_BASE+11)
+#define MRT_FLUSH	(MRT_BASE+12)	/* Flush all mfc entries and/or vifs	*/
+#define MRT_MAX		(MRT_BASE+12)
 
 #define SIOCGETVIFCNT	SIOCPROTOPRIVATE	/* IP protocol privates */
 #define SIOCGETSGCNT	(SIOCPROTOPRIVATE+1)
 #define SIOCGETRPF	(SIOCPROTOPRIVATE+2)
 
+/* MRT_FLUSH optional flags */
+#define MRT_FLUSH_MFC	1	/* Flush multicast entries */
+#define MRT_FLUSH_MFC_STATIC	2	/* Flush static multicast entries */
+#define MRT_FLUSH_VIFS	4	/* Flush multicast vifs */
+#define MRT_FLUSH_VIFS_STATIC	8	/* Flush static multicast vifs */
+
 #define MAXVIFS		32
 typedef unsigned long vifbitmap_t;	/* User mode code depends on this lot */
 typedef unsigned short vifi_t;
diff --git a/include/uapi/linux/mroute6.h b/include/uapi/linux/mroute6.h
index 9999cc006390..ac84ef11b29c 100644
--- a/include/uapi/linux/mroute6.h
+++ b/include/uapi/linux/mroute6.h
@@ -31,12 +31,19 @@
 #define MRT6_TABLE	(MRT6_BASE+9)	/* Specify mroute table ID		*/
 #define MRT6_ADD_MFC_PROXY	(MRT6_BASE+10)	/* Add a (*,*|G) mfc entry	*/
 #define MRT6_DEL_MFC_PROXY	(MRT6_BASE+11)	/* Del a (*,*|G) mfc entry	*/
-#define MRT6_MAX	(MRT6_BASE+11)
+#define MRT6_FLUSH	(MRT6_BASE+12)	/* Flush all mfc entries and/or vifs	*/
+#define MRT6_MAX	(MRT6_BASE+12)
 
 #define SIOCGETMIFCNT_IN6	SIOCPROTOPRIVATE	/* IP protocol privates */
 #define SIOCGETSGCNT_IN6	(SIOCPROTOPRIVATE+1)
 #define SIOCGETRPF	(SIOCPROTOPRIVATE+2)
 
+/* MRT6_FLUSH optional flags */
+#define MRT6_FLUSH_MFC	1	/* Flush multicast entries */
+#define MRT6_FLUSH_MFC_STATIC	2	/* Flush static multicast entries */
+#define MRT6_FLUSH_VIFS	4	/* Flushing multicast vifs */
+#define MRT6_FLUSH_VIFS_STATIC	8	/* Flush static multicast vifs */
+
 #define MAXMIFS		32
 typedef unsigned long mifbitmap_t;	/* User mode code depends on this lot */
 typedef unsigned short mifi_t;
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index e536970557dd..53869779af74 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -110,7 +110,7 @@ static int ipmr_cache_report(struct mr_table *mrt,
 static void mroute_netlink_event(struct mr_table *mrt, struct mfc_cache *mfc,
 				 int cmd);
 static void igmpmsg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
-static void mroute_clean_tables(struct mr_table *mrt, bool all);
+static void mroute_clean_tables(struct mr_table *mrt, int flags);
 static void ipmr_expire_process(struct timer_list *t);
 
 #ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
@@ -415,7 +415,8 @@ static struct mr_table *ipmr_new_table(struct net *net, u32 id)
 static void ipmr_free_table(struct mr_table *mrt)
 {
 	del_timer_sync(&mrt->ipmr_expire_timer);
-	mroute_clean_tables(mrt, true);
+	mroute_clean_tables(mrt, MRT_FLUSH_VIFS | MRT_FLUSH_VIFS_STATIC |
+					  MRT_FLUSH_MFC | MRT_FLUSH_MFC_STATIC);
 	rhltable_destroy(&mrt->mfc_hash);
 	kfree(mrt);
 }
@@ -1296,7 +1297,7 @@ static int ipmr_mfc_add(struct net *net, struct mr_table *mrt,
 }
 
 /* Close the multicast socket, and clear the vif tables etc */
-static void mroute_clean_tables(struct mr_table *mrt, bool all)
+static void mroute_clean_tables(struct mr_table *mrt, int flags)
 {
 	struct net *net = read_pnet(&mrt->net);
 	struct mr_mfc *c, *tmp;
@@ -1305,35 +1306,44 @@ static void mroute_clean_tables(struct mr_table *mrt, bool all)
 	int i;
 
 	/* Shut down all active vif entries */
-	for (i = 0; i < mrt->maxvif; i++) {
-		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
-			continue;
-		vif_delete(mrt, i, 0, &list);
+	if (flags & (MRT_FLUSH_VIFS | MRT_FLUSH_VIFS_STATIC)) {
+		for (i = 0; i < mrt->maxvif; i++) {
+			if (((mrt->vif_table[i].flags & VIFF_STATIC) &&
+			     !(flags & MRT_FLUSH_VIFS_STATIC)) ||
+			    (!(mrt->vif_table[i].flags & VIFF_STATIC) && !(flags & MRT_FLUSH_VIFS)))
+				continue;
+			vif_delete(mrt, i, 0, &list);
+		}
+		unregister_netdevice_many(&list);
 	}
-	unregister_netdevice_many(&list);
 
 	/* Wipe the cache */
-	list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
-		if (!all && (c->mfc_flags & MFC_STATIC))
-			continue;
-		rhltable_remove(&mrt->mfc_hash, &c->mnode, ipmr_rht_params);
-		list_del_rcu(&c->list);
-		cache = (struct mfc_cache *)c;
-		call_ipmr_mfc_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, cache,
-					      mrt->id);
-		mroute_netlink_event(mrt, cache, RTM_DELROUTE);
-		mr_cache_put(c);
-	}
-
-	if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
-		spin_lock_bh(&mfc_unres_lock);
-		list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
-			list_del(&c->list);
+	if (flags & (MRT_FLUSH_MFC | MRT_FLUSH_MFC_STATIC)) {
+		list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
+			if (((c->mfc_flags & MFC_STATIC) && !(flags & MRT_FLUSH_MFC_STATIC)) ||
+			    (!(c->mfc_flags & MFC_STATIC) && !(flags & MRT_FLUSH_MFC)))
+				continue;
+			rhltable_remove(&mrt->mfc_hash, &c->mnode, ipmr_rht_params);
+			list_del_rcu(&c->list);
 			cache = (struct mfc_cache *)c;
+			call_ipmr_mfc_entry_notifiers(net, FIB_EVENT_ENTRY_DEL, cache,
+						      mrt->id);
 			mroute_netlink_event(mrt, cache, RTM_DELROUTE);
-			ipmr_destroy_unres(mrt, cache);
+			mr_cache_put(c);
+		}
+	}
+
+	if (flags & MRT_FLUSH_MFC) {
+		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
+			spin_lock_bh(&mfc_unres_lock);
+			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
+				list_del(&c->list);
+				cache = (struct mfc_cache *)c;
+				mroute_netlink_event(mrt, cache, RTM_DELROUTE);
+				ipmr_destroy_unres(mrt, cache);
+			}
+			spin_unlock_bh(&mfc_unres_lock);
 		}
-		spin_unlock_bh(&mfc_unres_lock);
 	}
 }
 
@@ -1354,7 +1364,7 @@ static void mrtsock_destruct(struct sock *sk)
 						    NETCONFA_IFINDEX_ALL,
 						    net->ipv4.devconf_all);
 			RCU_INIT_POINTER(mrt->mroute_sk, NULL);
-			mroute_clean_tables(mrt, false);
+			mroute_clean_tables(mrt, MRT_FLUSH_VIFS | MRT_FLUSH_MFC);
 		}
 	}
 	rtnl_unlock();
@@ -1479,6 +1489,17 @@ int ip_mroute_setsockopt(struct sock *sk, int optname, char __user *optval,
 					   sk == rtnl_dereference(mrt->mroute_sk),
 					   parent);
 		break;
+	case MRT_FLUSH:
+		if (optlen != sizeof(val)) {
+			ret = -EINVAL;
+			break;
+		}
+		if (get_user(val, (int __user *)optval)) {
+			ret = -EFAULT;
+			break;
+		}
+		mroute_clean_tables(mrt, val);
+		break;
 	/* Control PIM assert. */
 	case MRT_ASSERT:
 		if (optlen != sizeof(val)) {
diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c
index cc01aa3f2b5e..b67a7c1e3615 100644
--- a/net/ipv6/ip6mr.c
+++ b/net/ipv6/ip6mr.c
@@ -97,7 +97,7 @@ static void mr6_netlink_event(struct mr_table *mrt, struct mfc6_cache *mfc,
 static void mrt6msg_netlink_event(struct mr_table *mrt, struct sk_buff *pkt);
 static int ip6mr_rtm_dumproute(struct sk_buff *skb,
 			       struct netlink_callback *cb);
-static void mroute_clean_tables(struct mr_table *mrt, bool all);
+static void mroute_clean_tables(struct mr_table *mrt, int flags);
 static void ipmr_expire_process(struct timer_list *t);
 
 #ifdef CONFIG_IPV6_MROUTE_MULTIPLE_TABLES
@@ -393,7 +393,8 @@ static struct mr_table *ip6mr_new_table(struct net *net, u32 id)
 static void ip6mr_free_table(struct mr_table *mrt)
 {
 	del_timer_sync(&mrt->ipmr_expire_timer);
-	mroute_clean_tables(mrt, true);
+	mroute_clean_tables(mrt, MRT6_FLUSH_VIFS | MRT6_FLUSH_VIFS_STATIC |
+					  MRT6_FLUSH_MFC | MRT6_FLUSH_MFC_STATIC);
 	rhltable_destroy(&mrt->mfc_hash);
 	kfree(mrt);
 }
@@ -1496,42 +1497,51 @@ static int ip6mr_mfc_add(struct net *net, struct mr_table *mrt,
  *	Close the multicast socket, and clear the vif tables etc
  */
 
-static void mroute_clean_tables(struct mr_table *mrt, bool all)
+static void mroute_clean_tables(struct mr_table *mrt, int flags)
 {
 	struct mr_mfc *c, *tmp;
 	LIST_HEAD(list);
 	int i;
 
 	/* Shut down all active vif entries */
-	for (i = 0; i < mrt->maxvif; i++) {
-		if (!all && (mrt->vif_table[i].flags & VIFF_STATIC))
-			continue;
-		mif6_delete(mrt, i, 0, &list);
+	if (flags & (MRT6_FLUSH_VIFS | MRT6_FLUSH_VIFS_STATIC)) {
+		for (i = 0; i < mrt->maxvif; i++) {
+			if (((mrt->vif_table[i].flags & VIFF_STATIC) &&
+			     !(flags & MRT6_FLUSH_VIFS_STATIC)) ||
+			    (!(mrt->vif_table[i].flags & VIFF_STATIC) && !(flags & MRT6_FLUSH_VIFS)))
+				continue;
+			mif6_delete(mrt, i, 0, &list);
+		}
+		unregister_netdevice_many(&list);
 	}
-	unregister_netdevice_many(&list);
 
 	/* Wipe the cache */
-	list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
-		if (!all && (c->mfc_flags & MFC_STATIC))
-			continue;
-		rhltable_remove(&mrt->mfc_hash, &c->mnode, ip6mr_rht_params);
-		list_del_rcu(&c->list);
-		call_ip6mr_mfc_entry_notifiers(read_pnet(&mrt->net),
-					       FIB_EVENT_ENTRY_DEL,
-					       (struct mfc6_cache *)c, mrt->id);
-		mr6_netlink_event(mrt, (struct mfc6_cache *)c, RTM_DELROUTE);
-		mr_cache_put(c);
+	if (flags & (MRT6_FLUSH_MFC | MRT6_FLUSH_MFC_STATIC)) {
+		list_for_each_entry_safe(c, tmp, &mrt->mfc_cache_list, list) {
+			if (((c->mfc_flags & MFC_STATIC) && !(flags & MRT6_FLUSH_MFC_STATIC)) ||
+			    (!(c->mfc_flags & MFC_STATIC) && !(flags & MRT6_FLUSH_MFC)))
+				continue;
+			rhltable_remove(&mrt->mfc_hash, &c->mnode, ip6mr_rht_params);
+			list_del_rcu(&c->list);
+			call_ip6mr_mfc_entry_notifiers(read_pnet(&mrt->net),
+						       FIB_EVENT_ENTRY_DEL,
+										   (struct mfc6_cache *)c, mrt->id);
+			mr6_netlink_event(mrt, (struct mfc6_cache *)c, RTM_DELROUTE);
+			mr_cache_put(c);
+		}
 	}
 
-	if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
-		spin_lock_bh(&mfc_unres_lock);
-		list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
-			list_del(&c->list);
-			mr6_netlink_event(mrt, (struct mfc6_cache *)c,
-					  RTM_DELROUTE);
-			ip6mr_destroy_unres(mrt, (struct mfc6_cache *)c);
+	if (flags & MRT6_FLUSH_MFC) {
+		if (atomic_read(&mrt->cache_resolve_queue_len) != 0) {
+			spin_lock_bh(&mfc_unres_lock);
+			list_for_each_entry_safe(c, tmp, &mrt->mfc_unres_queue, list) {
+				list_del(&c->list);
+				mr6_netlink_event(mrt, (struct mfc6_cache *)c,
+						  RTM_DELROUTE);
+				ip6mr_destroy_unres(mrt, (struct mfc6_cache *)c);
+			}
+			spin_unlock_bh(&mfc_unres_lock);
 		}
-		spin_unlock_bh(&mfc_unres_lock);
 	}
 }
 
@@ -1587,7 +1597,7 @@ int ip6mr_sk_done(struct sock *sk)
 						     NETCONFA_IFINDEX_ALL,
 						     net->ipv6.devconf_all);
 
-			mroute_clean_tables(mrt, false);
+			mroute_clean_tables(mrt, MRT6_FLUSH_VIFS | MRT6_FLUSH_MFC);
 			err = 0;
 			break;
 		}
@@ -1703,6 +1713,20 @@ int ip6_mroute_setsockopt(struct sock *sk, int optname, char __user *optval, uns
 		rtnl_unlock();
 		return ret;
 
+	case MRT6_FLUSH:
+	{
+		int flags;
+
+		if (optlen != sizeof(flags))
+			return -EINVAL;
+		if (get_user(flags, (int __user *)optval))
+			return -EFAULT;
+		rtnl_lock();
+		mroute_clean_tables(mrt, flags);
+		rtnl_unlock();
+		return 0;
+	}
+
 	/*
 	 *	Control PIM assert (to activate pim will activate assert)
 	 */
-- 
2.20.1


^ permalink raw reply related

* [PATCH net-next v5] ipmr: ip6mr: Create new sockopt to clear mfc cache or vifs
From: Callum Sinclair @ 2019-02-14  2:44 UTC (permalink / raw)
  To: davem, kuznet, yoshfuji, nikolay, netdev, linux-kernel
  Cc: nicolas.dichtel, Callum Sinclair

Created a way to clear the multicast forwarding cache on a socket
without having to either remove the entries manually using the delete
entry socket option or destroy and recreate the multicast socket.

Calling the socket option MRT_FLUSH will allow any combination of the
four flag options to be cleared.

MRT_FLUSH_MFC will clear all non static mfc entries and clear the unresolved cache
MRT_FLUSH_MFC_STATIC will clear all static mfc entries
MRT_FLUSH_VIFS will clear all non static interfaces
MRT_FLUSH_VIFS_STATIC will clear all static interfaces.

Callum Sinclair (1):
  ipmr: ip6mr: Create new sockopt to clear mfc cache or vifs

 include/uapi/linux/mroute.h  |  9 ++++-
 include/uapi/linux/mroute6.h |  9 ++++-
 net/ipv4/ipmr.c              | 75 +++++++++++++++++++++-------------
 net/ipv6/ip6mr.c             | 78 +++++++++++++++++++++++-------------
 4 files changed, 115 insertions(+), 56 deletions(-)

-- 
2.20.1

^ permalink raw reply

* Re: r8169 Driver - Poor Network Performance Since Kernel 4.19
From: David Chang @ 2019-02-14  2:45 UTC (permalink / raw)
  To: Heiner Kallweit; +Cc: Realtek linux nic maintainers, netdev, Martti Laaksonen
In-Reply-To: <856b3a75-5daf-6ce8-7fa3-0405e3cefe97@gmail.com>

Hi Heiner,

On Feb 05, 2019 at 19:50:30 +0100, Heiner Kallweit wrote:
> Hi David,
> 
> meanwhile there's the following bug report matching what reported.
> It's even the same chip version (RTL8168h).
> https://bugzilla.redhat.com/show_bug.cgi?id=1671958
> 
> Symptom there is also a significant number of rx_missed packets.
> Could you try what I mentioned there last:
> Try building a kernel with the call to rtl_hw_aspm_clkreq_enable(tp, true) at the
> end of rtl_hw_start_8168h_1() being disabled.

After disabled the aspm function that you mentioned, we finally got the
positive testing result. And the rx_missed error was gone. If without
the patch, the receive side get back to bad performance.

kernel: r8169: loading out-of-tree module taints kernel.
kernel: r8169: module verification failed: signature and/or required key missing - tainting kernel
kernel: libphy: r8169: probed
kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, ec:8e:b5:5a:2c:f5, XID 54100880, IRQ 128
kernel: r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
kernel: r8169 0000:01:00.0 enp1s0: renamed from eth0
kernel: Generic PHY r8169-100:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=r8169-100:00, irq=IGNORE)
kernel: r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control off

NIC statistics:
     tx_packets: 1653804
     rx_packets: 1555966
     tx_errors: 0
     rx_errors: 0
     rx_missed: 0
     align_errors: 0
     tx_single_collisions: 0
     tx_multi_collisions: 0
     unicast: 1555884
     broadcast: 78
     multicast: 4
     tx_aborted: 0
     tx_underrun: 0

iperf receive:
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------
Accepted connection from 10.x.x.x, port 55516
[  5] local 10.x.x.x port 5201 connected to 10.x.x.x port 58172
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   108 MBytes   906 Mbits/sec
[  5]   1.00-2.00   sec   112 MBytes   941 Mbits/sec
[  5]   2.00-3.00   sec   112 MBytes   940 Mbits/sec
[  5]   3.00-4.00   sec   112 MBytes   941 Mbits/sec
[  5]   4.00-5.00   sec   112 MBytes   941 Mbits/sec
[  5]   5.00-6.00   sec   112 MBytes   942 Mbits/sec
[  5]   6.00-7.00   sec   112 MBytes   939 Mbits/sec
[  5]   7.00-8.00   sec   112 MBytes   941 Mbits/sec
[  5]   8.00-9.00   sec   112 MBytes   938 Mbits/sec
[  5]   9.00-10.00  sec   112 MBytes   941 Mbits/sec
[  5]  10.00-11.00  sec   112 MBytes   941 Mbits/sec
[...]
[  5]  50.00-51.00  sec   112 MBytes   941 Mbits/sec
[  5]  51.00-52.00  sec   112 MBytes   941 Mbits/sec
[  5]  52.00-53.00  sec   112 MBytes   942 Mbits/sec
[  5]  53.00-54.00  sec   112 MBytes   941 Mbits/sec
[  5]  54.00-55.00  sec   111 MBytes   934 Mbits/sec
[  5]  55.00-56.00  sec   112 MBytes   942 Mbits/sec
[  5]  56.00-57.00  sec   112 MBytes   937 Mbits/sec
[  5]  57.00-58.00  sec   112 MBytes   941 Mbits/sec
[  5]  58.00-59.00  sec   111 MBytes   932 Mbits/sec
[  5]  59.00-60.00  sec   112 MBytes   942 Mbits/sec
[  5]  60.00-60.04  sec  4.06 MBytes   939 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-60.04  sec  6.57 GBytes   940 Mbits/sec                  receiver

regards,
David

> 
> Heiner
> 
> 
> On 31.01.2019 03:32, David Chang wrote:
> > Hi,
> > 
> > We had a similr case here.
> > - Realtek r8169 receive performance regression in kernel 4.19
> >   https://bugzilla.suse.com/show_bug.cgi?id=1119649
> > 
> > kernel: r8169 0000:01:00.0 eth0: RTL8168h/8111h, XID 54100880
> > The major symptom is there are many rx_missed count.
> > 
> > 
> > On Jan 30, 2019 at 20:15:45 +0100, Heiner Kallweit wrote:
> >> Hi Peter,
> >>
> >> recently I had somebody where pcie_aspm=off for whatever reason didn't
> >> do the trick, can you also check with pcie_aspm.policy=performance.
> > 
> > We will give it a try later.
> > 
> >> And please check with "ethtool -S <if>" whether the chip statistics
> >> show a significant number of errors.
> >>
> >> If this doesn't help you may have to bisect to find the offending commit.
> > 
> > We had tried fallback driver to a few previous commits as following,
> > but with no luck.
> > 
> > 9675931e6b65 r8169: re-enable MSI-X on RTL8168g (v4.19)
> > 098b01ad9837 r8169: don't include asm headers directly (v4.19-rc1)
> > a2965f12fde6 r8169: remove rtl8169_set_speed_xmii (v4.19-rc1)
> > 6fcf9b1d4d6c r8169: fix runtime suspend (v4.19-rc1)
> > e397286b8e89 r8169: remove TBI 1000BaseX support (v4.19-rc1)
> > 
> > Thanks,
> > David Chang
> > 
> >>
> >> Heiner
> >>
> >>
> >> On 30.01.2019 10:59, Peter Ceiley wrote:
> >>> Hi Heiner,
> >>>
> >>> I tried disabling the ASPM using the pcie_aspm=off kernel parameter
> >>> and this made no difference.
> >>>
> >>> I tried compiling the 4.18.16 r8169.c with the 4.19.18 source and
> >>> subsequently loaded the module in the running 4.19.18 kernel. I can
> >>> confirm that this immediately resolved the issue and access to the NFS
> >>> shares operated as expected.
> >>>
> >>> I presume this means it is an issue with the r8169 driver included in
> >>> 4.19 onwards?
> >>>
> >>> To answer your last questions:
> >>>
> >>> Base Board Information
> >>>     Manufacturer: Alienware
> >>>     Product Name: 0PGRP5
> >>>     Version: A02
> >>>
> >>> ... and yes, the RTL8168 is the onboard network chip.
> >>>
> >>> Regards,
> >>>
> >>> Peter.
> >>>
> >>> On Tue, 29 Jan 2019 at 17:44, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>
> >>>> I think the vendor driver doesn't enable ASPM per default.
> >>>> So it's worth a try to disable ASPM in the BIOS or via sysfs.
> >>>> Few older systems seem to have issues with ASPM, what kind of
> >>>> system / mainboard are you using? The RTL8168 is the onboard
> >>>> network chip?
> >>>>
> >>>> Rgds, Heiner
> >>>>
> >>>>
> >>>> On 29.01.2019 07:20, Peter Ceiley wrote:
> >>>>> Hi Heiner,
> >>>>>
> >>>>> Thanks, I'll do some more testing. It might not be the driver - I
> >>>>> assumed it was due to the fact that using the r8168 driver 'resolves'
> >>>>> the issue. I'll see if I can test the r8169.c on top of 4.19 - this is
> >>>>> a good idea.
> >>>>>
> >>>>> Cheers,
> >>>>>
> >>>>> Peter.
> >>>>>
> >>>>> On Tue, 29 Jan 2019 at 17:16, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Peter,
> >>>>>>
> >>>>>> at a first glance it doesn't look like a typical driver issue.
> >>>>>> What you could do:
> >>>>>>
> >>>>>> - Test the r8169.c from 4.18 on top of 4.19.
> >>>>>>
> >>>>>> - Check whether disabling ASPM (/sys/module/pcie_aspm) has an effect.
> >>>>>>
> >>>>>> - Bisect between 4.18 and 4.19 to find the offending commit.
> >>>>>>
> >>>>>> Any specific reason why you think root cause is in the driver and not
> >>>>>> elsewhere in the network subsystem?
> >>>>>>
> >>>>>> Heiner
> >>>>>>
> >>>>>>
> >>>>>> On 28.01.2019 23:10, Peter Ceiley wrote:
> >>>>>>> Hi Heiner,
> >>>>>>>
> >>>>>>> Thanks for getting back to me.
> >>>>>>>
> >>>>>>> No, I don't use jumbo packets.
> >>>>>>>
> >>>>>>> Bandwidth is *generally* good, and iperf results to my NAS provide
> >>>>>>> over 900 Mbits/s in both circumstances. The issue seems to appear when
> >>>>>>> establishing a connection and is most notable, for example, on my
> >>>>>>> mounted NFS shares where it takes seconds (up to 10's of seconds on
> >>>>>>> larger directories) to list the contents of each directory. Once a
> >>>>>>> transfer begins on a file, I appear to get good bandwidth.
> >>>>>>>
> >>>>>>> I'm unsure of the best scientific data to provide you in order to
> >>>>>>> troubleshoot this issue. Running the following
> >>>>>>>
> >>>>>>>     netstat -s |grep retransmitted
> >>>>>>>
> >>>>>>> shows a steady increase in retransmitted segments each time I list the
> >>>>>>> contents of a remote directory, for example, running 'ls' on a
> >>>>>>> directory containing 345 media files did the following using kernel
> >>>>>>> 4.19.18:
> >>>>>>>
> >>>>>>> increased retransmitted segments by 21 and the 'time' command showed
> >>>>>>> the following:
> >>>>>>>     real    0m19.867s
> >>>>>>>     user    0m0.012s
> >>>>>>>     sys    0m0.036s
> >>>>>>>
> >>>>>>> The same command shows no retransmitted segments running kernel
> >>>>>>> 4.18.16 and 'time' showed:
> >>>>>>>     real    0m0.300s
> >>>>>>>     user    0m0.004s
> >>>>>>>     sys    0m0.007s
> >>>>>>>
> >>>>>>> ifconfig does not show any RX/TX errors nor dropped packets in either case.
> >>>>>>>
> >>>>>>> dmesg XID:
> >>>>>>> [    2.979984] r8169 0000:03:00.0 eth0: RTL8168g/8111g,
> >>>>>>> f8:b1:56:fe:67:e0, XID 4c000800, IRQ 32
> >>>>>>>
> >>>>>>> # lspci -vv
> >>>>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd.
> >>>>>>> RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
> >>>>>>>     Subsystem: Dell RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
> >>>>>>>     Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
> >>>>>>> ParErr- Stepping- SERR- FastB2B- DisINTx+
> >>>>>>>     Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
> >>>>>>> <TAbort- <MAbort- >SERR- <PERR- INTx-
> >>>>>>>     Latency: 0, Cache Line Size: 64 bytes
> >>>>>>>     Interrupt: pin A routed to IRQ 19
> >>>>>>>     Region 0: I/O ports at d000 [size=256]
> >>>>>>>     Region 2: Memory at f7b00000 (64-bit, non-prefetchable) [size=4K]
> >>>>>>>     Region 4: Memory at f2100000 (64-bit, prefetchable) [size=16K]
> >>>>>>>     Capabilities: [40] Power Management version 3
> >>>>>>>         Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA
> >>>>>>> PME(D0+,D1+,D2+,D3hot+,D3cold+)
> >>>>>>>         Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
> >>>>>>>     Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
> >>>>>>>         Address: 0000000000000000  Data: 0000
> >>>>>>>     Capabilities: [70] Express (v2) Endpoint, MSI 01
> >>>>>>>         DevCap:    MaxPayload 128 bytes, PhantFunc 0, Latency L0s
> >>>>>>> <512ns, L1 <64us
> >>>>>>>             ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
> >>>>>>> SlotPowerLimit 10.000W
> >>>>>>>         DevCtl:    CorrErr- NonFatalErr- FatalErr- UnsupReq-
> >>>>>>>             RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
> >>>>>>>             MaxPayload 128 bytes, MaxReadReq 4096 bytes
> >>>>>>>         DevSta:    CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
> >>>>>>>         LnkCap:    Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit
> >>>>>>> Latency L0s unlimited, L1 <64us
> >>>>>>>             ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
> >>>>>>>         LnkCtl:    ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
> >>>>>>>             ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
> >>>>>>>         LnkSta:    Speed 2.5GT/s (ok), Width x1 (ok)
> >>>>>>>             TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> >>>>>>>         DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR+,
> >>>>>>> OBFF Via message/WAKE#
> >>>>>>>              AtomicOpsCap: 32bit- 64bit- 128bitCAS-
> >>>>>>>         DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+,
> >>>>>>> OBFF Disabled
> >>>>>>>              AtomicOpsCtl: ReqEn-
> >>>>>>>         LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
> >>>>>>>              Transmit Margin: Normal Operating Range,
> >>>>>>> EnterModifiedCompliance- ComplianceSOS-
> >>>>>>>              Compliance De-emphasis: -6dB
> >>>>>>>         LnkSta2: Current De-emphasis Level: -6dB,
> >>>>>>> EqualizationComplete-, EqualizationPhase1-
> >>>>>>>              EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
> >>>>>>>     Capabilities: [b0] MSI-X: Enable+ Count=4 Masked-
> >>>>>>>         Vector table: BAR=4 offset=00000000
> >>>>>>>         PBA: BAR=4 offset=00000800
> >>>>>>>     Capabilities: [d0] Vital Product Data
> >>>>>>> pcilib: sysfs_read_vpd: read failed: Input/output error
> >>>>>>>         Not readable
> >>>>>>>     Capabilities: [100 v1] Advanced Error Reporting
> >>>>>>>         UESta:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> >>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> >>>>>>>         UEMsk:    DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt-
> >>>>>>> RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
> >>>>>>>         UESvrt:    DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt-
> >>>>>>> RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
> >>>>>>>         CESta:    RxErr+ BadTLP+ BadDLLP+ Rollover- Timeout+ AdvNonFatalErr-
> >>>>>>>         CEMsk:    RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> >>>>>>>         AERCap:    First Error Pointer: 00, ECRCGenCap+ ECRCGenEn-
> >>>>>>> ECRCChkCap+ ECRCChkEn-
> >>>>>>>             MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
> >>>>>>>         HeaderLog: 00000000 00000000 00000000 00000000
> >>>>>>>     Capabilities: [140 v1] Virtual Channel
> >>>>>>>         Caps:    LPEVC=0 RefClk=100ns PATEntryBits=1
> >>>>>>>         Arb:    Fixed- WRR32- WRR64- WRR128-
> >>>>>>>         Ctrl:    ArbSelect=Fixed
> >>>>>>>         Status:    InProgress-
> >>>>>>>         VC0:    Caps:    PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> >>>>>>>             Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
> >>>>>>>             Ctrl:    Enable+ ID=0 ArbSelect=Fixed TC/VC=01
> >>>>>>>             Status:    NegoPending- InProgress-
> >>>>>>>     Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
> >>>>>>>     Capabilities: [170 v1] Latency Tolerance Reporting
> >>>>>>>         Max snoop latency: 71680ns
> >>>>>>>         Max no snoop latency: 71680ns
> >>>>>>>     Kernel driver in use: r8169
> >>>>>>>     Kernel modules: r8169
> >>>>>>>
> >>>>>>> Please let me know if you have any other ideas in terms of testing.
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Peter.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, 29 Jan 2019 at 05:28, Heiner Kallweit <hkallweit1@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> On 28.01.2019 12:13, Peter Ceiley wrote:
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I have been experiencing very poor network performance since Kernel
> >>>>>>>>> 4.19 and I'm confident it's related to the r8169 driver.
> >>>>>>>>>
> >>>>>>>>> I have no issue with kernel versions 4.18 and prior. I am experiencing
> >>>>>>>>> this issue in kernels 4.19 and 4.20 (currently running/testing with
> >>>>>>>>> 4.20.4 & 4.19.18).
> >>>>>>>>>
> >>>>>>>>> If someone could guide me in the right direction, I'm happy to help
> >>>>>>>>> troubleshoot this issue. Note that I have been keeping an eye on one
> >>>>>>>>> issue related to loading of the PHY driver, however, my symptoms
> >>>>>>>>> differ in that I still have a network connection. I have attempted to
> >>>>>>>>> reload the driver on a running system, but this does not improve the
> >>>>>>>>> situation.
> >>>>>>>>>
> >>>>>>>>> Using the proprietary r8168 driver returns my device to proper working order.
> >>>>>>>>>
> >>>>>>>>> lshw shows:
> >>>>>>>>>        description: Ethernet interface
> >>>>>>>>>        product: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
> >>>>>>>>>        vendor: Realtek Semiconductor Co., Ltd.
> >>>>>>>>>        physical id: 0
> >>>>>>>>>        bus info: pci@0000:03:00.0
> >>>>>>>>>        logical name: enp3s0
> >>>>>>>>>        version: 0c
> >>>>>>>>>        serial:
> >>>>>>>>>        size: 1Gbit/s
> >>>>>>>>>        capacity: 1Gbit/s
> >>>>>>>>>        width: 64 bits
> >>>>>>>>>        clock: 33MHz
> >>>>>>>>>        capabilities: pm msi pciexpress msix vpd bus_master cap_list
> >>>>>>>>> ethernet physical tp aui bnc mii fibre 10bt 10bt-fd 100bt 100bt-fd
> >>>>>>>>> 1000bt-fd autonegotiation
> >>>>>>>>>        configuration: autonegotiation=on broadcast=yes driver=r8169
> >>>>>>>>> duplex=full firmware=rtl8168g-2_0.0.1 02/06/13 ip=192.168.1.25
> >>>>>>>>> latency=0 link=yes multicast=yes port=MII speed=1Gbit/s
> >>>>>>>>>        resources: irq:19 ioport:d000(size=256)
> >>>>>>>>> memory:f7b00000-f7b00fff memory:f2100000-f2103fff
> >>>>>>>>>
> >>>>>>>>> Kind Regards,
> >>>>>>>>>
> >>>>>>>>> Peter.
> >>>>>>>>>
> >>>>>>>> Hi Peter,
> >>>>>>>>
> >>>>>>>> the description "poor network performance" is quite vague, therefore:
> >>>>>>>>
> >>>>>>>> - Can you provide any measurements?
> >>>>>>>> - iperf results before and after
> >>>>>>>> - statistics about dropped packets (rx and/or tx)
> >>>>>>>> - Do you use jumbo packets?
> >>>>>>>>
> >>>>>>>> Also help would be a "lspci -vv" output for the network card and
> >>>>>>>> the dmesg output line with the chip XID.
> >>>>>>>>
> >>>>>>>> Heiner
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> > 
> 
> 

^ permalink raw reply

* Re: [PATCH bpf-next v11 0/7] bpf: add BPF_LWT_ENCAP_IP option to bpf_lwt_push_encap
From: David Ahern @ 2019-02-14  3:44 UTC (permalink / raw)
  To: Peter Oskolkov
  Cc: Alexei Starovoitov, Alexei Starovoitov, Daniel Borkmann, netdev,
	Peter Oskolkov, Willem de Bruijn
In-Reply-To: <20190214023916.fu6ymperb4lqi632@ast-mbp>

On 2/13/19 7:39 PM, Alexei Starovoitov wrote:
> On Wed, Feb 13, 2019 at 05:46:26PM -0700, David Ahern wrote:
>> On 2/13/19 12:53 PM, Peter Oskolkov wrote:
>>> This patchset implements BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
>>> BPF helper. It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN
>>> and BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
>>> to packets (e.g. IP/GRE, GUE, IPIP).
>>>
>>> This is useful when thousands of different short-lived flows should be
>>> encapped, each with different and dynamically determined destination.
>>> Although lwtunnels can be used in some of these scenarios, the ability
>>> to dynamically generate encap headers adds more flexibility, e.g.
>>> when routing depends on the state of the host (reflected in global bpf
>>> maps).
>>>
>>
>>
>> For the set:
>> Reviewed-by: David Ahern <dsahern@gmail.com>
> 
> Applied. Thanks everyone!
> 

Looks like a cleanup round is needed.

I changed the routes to fail with unreachable:

@@ -179,16 +175,16 @@
 	ip -netns ${NS3} tunnel add gre_dev mode gre remote ${IPv4_1} local
${IPv4_GRE} ttl 255
 	ip -netns ${NS3} link set gre_dev up
 	ip -netns ${NS3} addr add ${IPv4_GRE} dev gre_dev
-	ip -netns ${NS1} route add ${IPv4_GRE}/32 dev veth5 via ${IPv4_6}
-	ip -netns ${NS2} route add ${IPv4_GRE}/32 dev veth7 via ${IPv4_8}
+	ip -netns ${NS1} route add unreachable ${IPv4_GRE}/32
+	ip -netns ${NS2} route add unreachable ${IPv4_GRE}/32


 	# configure IPv6 GRE device in NS3, and a route to it via the "bottom"
route
 	ip -netns ${NS3} -6 tunnel add name gre6_dev mode ip6gre remote
${IPv6_1} local ${IPv6_GRE} ttl 255
 	ip -netns ${NS3} link set gre6_dev up
 	ip -netns ${NS3} -6 addr add ${IPv6_GRE} nodad dev gre6_dev
-	ip -netns ${NS1} -6 route add ${IPv6_GRE}/128 dev veth5 via ${IPv6_6}
-	ip -netns ${NS2} -6 route add ${IPv6_GRE}/128 dev veth7 via ${IPv6_8}
+	ip -netns ${NS1} -6 route add unreachable ${IPv6_GRE}/128
+	ip -netns ${NS2} -6 route add unreachable ${IPv6_GRE}/128

 	# rp_filter gets confused by what these tests are doing, so disable it
 	ip netns exec ${NS1} sysctl -wq net.ipv4.conf.all.rp_filter=0
@@ -220,7 +216,6 @@


and then removed all of the set -e and exit 1's in the script (really
should let all of the tests run versus bailing on the first failure).

With kmemleak enabled I see a lot of suspected memory leaks - some may
not be related to this change but it is triggering the suspected leak:


unreferenced object 0xffff88813407a9c0 (size 160):
  comm "ping", pid 1040, jiffies 4294800240 (age 130.536s)
  hex dump (first 32 bytes):
    00 60 23 28 81 88 ff ff 80 d7 23 82 ff ff ff ff  .`#(......#.....
    c1 7f c8 81 ff ff ff ff 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<00000000dec307f3>] dst_alloc+0x89/0xc1
    [<0000000037c7c09a>] rt_dst_alloc+0x57/0xd4
    [<00000000850d146d>] ip_route_output_key_hash_rcu+0x57a/0x64d
    [<0000000059f3e271>] ip_route_output_key_hash+0x6e/0x98
    [<0000000093465e72>] ip_route_output_flow+0x1e/0x47
    [<000000007eee78d9>] raw_sendmsg+0x551/0xbd8
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff8881280bdf00 (size 224):
  comm "ping", pid 1040, jiffies 4294800240 (age 130.536s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 80 96 27 81 88 ff ff 40 da a2 27 81 88 ff ff  ...'....@..'....
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<0000000043c55b9b>] __alloc_skb+0x66/0x1b9
    [<00000000e92d2e81>] __ip_append_data+0x44f/0xa88
    [<00000000c0fa4285>] ip_append_data.part.19+0xa4/0xb7
    [<000000008437d83b>] ip_append_data+0x22/0x28
    [<0000000010065ae2>] raw_sendmsg+0xaff/0xbd8
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff888127ad6c00 (size 1024):
  comm "ping", pid 1040, jiffies 4294800240 (age 130.545s)
  hex dump (first 32 bytes):
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<000000006cd7857e>] __kmalloc_track_caller+0xfe/0x13f
    [<00000000ebec8a26>] __kmalloc_reserve.isra.17+0x2d/0x6d
    [<00000000827ebff0>] pskb_expand_head+0xcc/0x2d1
    [<0000000034bfc15f>] skb_cow_head+0xae/0xb7
    [<000000008f8c30fc>] bpf_lwt_push_ip_encap+0xb1/0x34c
    [<000000007add911b>] bpf_lwt_xmit_push_encap+0x1d/0x29
    [<00000000eecbf798>] ___bpf_prog_run+0xbc3/0x1757
    [<000000002d199add>] __bpf_prog_run32+0x42/0x58
    [<00000000d0c1f29b>] run_lwt_bpf.constprop.4+0xff/0x2e6
    [<00000000b9ad5b04>] bpf_xmit+0x3d/0xef
    [<000000003ba220bf>] lwtunnel_xmit+0xc7/0xeb
    [<00000000abfa977f>] ip_finish_output2+0x5b6/0x5e7
    [<000000005a652edf>] ip_finish_output+0x17f/0x191
    [<00000000cded5fe2>] ip_output+0x58/0x88
unreferenced object 0xffff888131020f00 (size 224):
  comm "ping", pid 1040, jiffies 4294800240 (age 130.545s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 80 96 27 81 88 ff ff 40 da a2 27 81 88 ff ff  ...'....@..'....
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<0000000043c55b9b>] __alloc_skb+0x66/0x1b9
    [<00000000e92d2e81>] __ip_append_data+0x44f/0xa88
    [<00000000c0fa4285>] ip_append_data.part.19+0xa4/0xb7
    [<000000008437d83b>] ip_append_data+0x22/0x28
    [<0000000010065ae2>] raw_sendmsg+0xaff/0xbd8
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff88812779cc00 (size 1024):
  comm "ping", pid 1040, jiffies 4294800240 (age 130.545s)
  hex dump (first 32 bytes):
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<000000006cd7857e>] __kmalloc_track_caller+0xfe/0x13f
    [<00000000ebec8a26>] __kmalloc_reserve.isra.17+0x2d/0x6d
    [<00000000827ebff0>] pskb_expand_head+0xcc/0x2d1
    [<0000000034bfc15f>] skb_cow_head+0xae/0xb7
    [<000000008f8c30fc>] bpf_lwt_push_ip_encap+0xb1/0x34c
    [<000000007add911b>] bpf_lwt_xmit_push_encap+0x1d/0x29
    [<00000000eecbf798>] ___bpf_prog_run+0xbc3/0x1757
    [<000000002d199add>] __bpf_prog_run32+0x42/0x58
    [<00000000d0c1f29b>] run_lwt_bpf.constprop.4+0xff/0x2e6
    [<00000000b9ad5b04>] bpf_xmit+0x3d/0xef
    [<000000003ba220bf>] lwtunnel_xmit+0xc7/0xeb
    [<00000000abfa977f>] ip_finish_output2+0x5b6/0x5e7
    [<000000005a652edf>] ip_finish_output+0x17f/0x191
    [<00000000cded5fe2>] ip_output+0x58/0x88
unreferenced object 0xffff888131abf980 (size 1632):
  comm "ping6", pid 1041, jiffies 4294801264 (age 129.529s)
  hex dump (first 32 bytes):
    00 00 00 00 7f 00 00 06 00 00 00 00 00 00 3a 00  ..............:.
    0a 00 07 41 00 00 00 00 00 00 00 00 00 00 00 00  ...A............
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<000000006287ee1f>] sk_prot_alloc.isra.27+0x30/0xb4
    [<00000000f5b12125>] sk_alloc+0x2e/0x1aa
    [<00000000c2b12d1b>] inet6_create+0x1ae/0x3a7
    [<000000000a3125dc>] __sock_create+0x1c1/0x22a
    [<00000000039d3cb8>] sock_create+0x30/0x32
    [<000000000556b08a>] __sys_socket+0x3d/0xb3
    [<000000007e47d085>] __x64_sys_socket+0x1a/0x1e
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff888127683f00 (size 224):
  comm "softirq", pid 0, jiffies 4294801264 (age 129.529s)
  hex dump (first 32 bytes):
    00 60 23 28 81 88 ff ff 00 2f 30 35 81 88 ff ff  .`#(...../05....
    c1 7f c8 81 ff ff ff ff 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<00000000dec307f3>] dst_alloc+0x89/0xc1
    [<000000002651f911>] ip6_dst_alloc+0x25/0x63
    [<000000008608d224>] ip6_pol_route+0x201/0x2ae
    [<000000000084a9eb>] ip6_pol_route_output+0x19/0x1b
    [<000000005d385680>] fib6_rule_lookup+0xe7/0x12c
    [<000000001a93c416>] ip6_route_output_flags+0xc5/0xd1
    [<0000000023cbe9f3>] ip6_dst_lookup_tail+0x1a3/0x364
    [<000000003afeb57e>] ip6_dst_lookup_flow+0x47/0x9b
    [<00000000938dbf6b>] rawv6_sendmsg+0x45f/0xdfc
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
unreferenced object 0xffff88812fb34f00 (size 224):
  comm "ping6", pid 1041, jiffies 4294801264 (age 129.530s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 80 96 27 81 88 ff ff 80 f9 ab 31 81 88 ff ff  ...'.......1....
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<0000000043c55b9b>] __alloc_skb+0x66/0x1b9
    [<0000000014f706bf>] __ip6_append_data+0x57c/0xc62
    [<0000000016ddc7e9>] ip6_append_data+0x135/0x148
    [<00000000210a1bd5>] rawv6_sendmsg+0xb19/0xdfc
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff8881276bdc00 (size 1024):
  comm "ping6", pid 1041, jiffies 4294801264 (age 129.537s)
  hex dump (first 32 bytes):
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<000000006cd7857e>] __kmalloc_track_caller+0xfe/0x13f
    [<00000000ebec8a26>] __kmalloc_reserve.isra.17+0x2d/0x6d
    [<00000000827ebff0>] pskb_expand_head+0xcc/0x2d1
    [<0000000034bfc15f>] skb_cow_head+0xae/0xb7
    [<000000008f8c30fc>] bpf_lwt_push_ip_encap+0xb1/0x34c
    [<000000007add911b>] bpf_lwt_xmit_push_encap+0x1d/0x29
    [<00000000eecbf798>] ___bpf_prog_run+0xbc3/0x1757
    [<000000002d199add>] __bpf_prog_run32+0x42/0x58
    [<00000000d0c1f29b>] run_lwt_bpf.constprop.4+0xff/0x2e6
    [<00000000b9ad5b04>] bpf_xmit+0x3d/0xef
    [<000000003ba220bf>] lwtunnel_xmit+0xc7/0xeb
    [<000000004c67bd98>] ip6_finish_output2+0x477/0x494
    [<00000000bf642298>] ip6_finish_output+0x106/0x110
    [<00000000365d3055>] ip6_output+0x87/0xbf
unreferenced object 0xffff8881278cbf00 (size 224):
  comm "ping6", pid 1041, jiffies 4294801264 (age 129.537s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 80 96 27 81 88 ff ff 80 f9 ab 31 81 88 ff ff  ...'.......1....
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<0000000043c55b9b>] __alloc_skb+0x66/0x1b9
    [<0000000014f706bf>] __ip6_append_data+0x57c/0xc62
    [<0000000016ddc7e9>] ip6_append_data+0x135/0x148
    [<00000000210a1bd5>] rawv6_sendmsg+0xb19/0xdfc
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff8881268fdc00 (size 1024):
  comm "ping6", pid 1041, jiffies 4294801264 (age 129.537s)
  hex dump (first 32 bytes):
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<000000006cd7857e>] __kmalloc_track_caller+0xfe/0x13f
    [<00000000ebec8a26>] __kmalloc_reserve.isra.17+0x2d/0x6d
    [<00000000827ebff0>] pskb_expand_head+0xcc/0x2d1
    [<0000000034bfc15f>] skb_cow_head+0xae/0xb7
    [<000000008f8c30fc>] bpf_lwt_push_ip_encap+0xb1/0x34c
    [<000000007add911b>] bpf_lwt_xmit_push_encap+0x1d/0x29
    [<00000000eecbf798>] ___bpf_prog_run+0xbc3/0x1757
    [<000000002d199add>] __bpf_prog_run32+0x42/0x58
    [<00000000d0c1f29b>] run_lwt_bpf.constprop.4+0xff/0x2e6
    [<00000000b9ad5b04>] bpf_xmit+0x3d/0xef
    [<000000003ba220bf>] lwtunnel_xmit+0xc7/0xeb
    [<000000004c67bd98>] ip6_finish_output2+0x477/0x494
    [<00000000bf642298>] ip6_finish_output+0x106/0x110
    [<00000000365d3055>] ip6_output+0x87/0xbf
unreferenced object 0xffff888127157e80 (size 128):
  comm "ip", pid 1157, jiffies 4294810718 (age 120.100s)
  hex dump (first 32 bytes):
    06 00 04 00 00 00 00 00 02 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<00000000bd21c202>] __kmalloc+0x102/0x143
    [<00000000ac289f37>] lwtunnel_state_alloc+0x1a/0x1c
    [<000000002acf5ea0>] bpf_build_state+0x8c/0x16a
    [<00000000e1c85c0f>] lwtunnel_build_state+0x10a/0x148
    [<00000000187eb239>] ip6_route_info_create+0x2ae/0x823
    [<0000000017a89b3a>] ip6_route_add+0x1a/0x4e
    [<00000000de590240>] inet6_rtm_newroute+0x62/0x80
    [<000000009356b68b>] rtnetlink_rcv_msg+0x22d/0x273
    [<000000003da74356>] netlink_rcv_skb+0x8b/0xd9
    [<000000005349126a>] rtnetlink_rcv+0x15/0x17
    [<00000000d24a54ac>] netlink_unicast+0x118/0x1b1
    [<0000000096e0cc4e>] netlink_sendmsg+0x328/0x34d
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<000000003872075e>] ___sys_sendmsg+0x1ad/0x238
unreferenced object 0xffff8881271d1180 (size 64):
  comm "ip", pid 1157, jiffies 4294810718 (age 120.100s)
  hex dump (first 32 bytes):
    74 65 73 74 5f 6c 77 74 5f 69 70 5f 65 6e 63 61  test_lwt_ip_enca
    70 2e 6f 3a 5b 65 6e 63 61 70 5f 67 72 65 36 5d  p.o:[encap_gre6]
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<000000006cd7857e>] __kmalloc_track_caller+0xfe/0x13f
    [<000000004a28667f>] kmemdup+0x20/0x35
    [<00000000f0fe083f>] bpf_parse_prog+0x77/0xc3
    [<000000005e53fa11>] bpf_build_state+0x108/0x16a
    [<00000000e1c85c0f>] lwtunnel_build_state+0x10a/0x148
    [<00000000187eb239>] ip6_route_info_create+0x2ae/0x823
    [<0000000017a89b3a>] ip6_route_add+0x1a/0x4e
    [<00000000de590240>] inet6_rtm_newroute+0x62/0x80
    [<000000009356b68b>] rtnetlink_rcv_msg+0x22d/0x273
    [<000000003da74356>] netlink_rcv_skb+0x8b/0xd9
    [<000000005349126a>] rtnetlink_rcv+0x15/0x17
    [<00000000d24a54ac>] netlink_unicast+0x118/0x1b1
    [<0000000096e0cc4e>] netlink_sendmsg+0x328/0x34d
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
unreferenced object 0xffff88813189ea40 (size 1432):
  comm "ping", pid 1159, jiffies 4294810742 (age 120.076s)
  hex dump (first 32 bytes):
    00 00 00 00 ac 10 01 64 00 00 00 00 00 00 01 00  .......d........
    02 00 07 41 00 00 00 00 00 00 00 00 00 00 00 00  ...A............
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<000000006287ee1f>] sk_prot_alloc.isra.27+0x30/0xb4
    [<00000000f5b12125>] sk_alloc+0x2e/0x1aa
    [<0000000048f46e3a>] inet_create+0x1ab/0x32e
    [<000000000a3125dc>] __sock_create+0x1c1/0x22a
    [<00000000039d3cb8>] sock_create+0x30/0x32
    [<000000000556b08a>] __sys_socket+0x3d/0xb3
    [<000000007e47d085>] __x64_sys_socket+0x1a/0x1e
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff8881270f2f00 (size 224):
  comm "ping", pid 1159, jiffies 4294810743 (age 120.084s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 30 af 26 81 88 ff ff 40 ea 89 31 81 88 ff ff  .0.&....@..1....
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<0000000043c55b9b>] __alloc_skb+0x66/0x1b9
    [<00000000e92d2e81>] __ip_append_data+0x44f/0xa88
    [<00000000c0fa4285>] ip_append_data.part.19+0xa4/0xb7
    [<000000008437d83b>] ip_append_data+0x22/0x28
    [<0000000010065ae2>] raw_sendmsg+0xaff/0xbd8
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff8881336dac00 (size 1024):
  comm "ping", pid 1159, jiffies 4294810743 (age 120.084s)
  hex dump (first 32 bytes):
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<000000006cd7857e>] __kmalloc_track_caller+0xfe/0x13f
    [<00000000ebec8a26>] __kmalloc_reserve.isra.17+0x2d/0x6d
    [<00000000827ebff0>] pskb_expand_head+0xcc/0x2d1
    [<0000000034bfc15f>] skb_cow_head+0xae/0xb7
    [<000000008f8c30fc>] bpf_lwt_push_ip_encap+0xb1/0x34c
    [<000000007add911b>] bpf_lwt_xmit_push_encap+0x1d/0x29
    [<00000000eecbf798>] ___bpf_prog_run+0xbc3/0x1757
    [<0000000055e881cc>] __bpf_prog_run64+0x42/0x58
    [<00000000d0c1f29b>] run_lwt_bpf.constprop.4+0xff/0x2e6
    [<00000000b9ad5b04>] bpf_xmit+0x3d/0xef
    [<000000003ba220bf>] lwtunnel_xmit+0xc7/0xeb
    [<00000000abfa977f>] ip_finish_output2+0x5b6/0x5e7
    [<000000005a652edf>] ip_finish_output+0x17f/0x191
    [<00000000cded5fe2>] ip_output+0x58/0x88
unreferenced object 0xffff8881283d0f00 (size 224):
  comm "ping", pid 1159, jiffies 4294810743 (age 120.084s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 30 af 26 81 88 ff ff 40 ea 89 31 81 88 ff ff  .0.&....@..1....
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<0000000043c55b9b>] __alloc_skb+0x66/0x1b9
    [<00000000e92d2e81>] __ip_append_data+0x44f/0xa88
    [<00000000c0fa4285>] ip_append_data.part.19+0xa4/0xb7
    [<000000008437d83b>] ip_append_data+0x22/0x28
    [<0000000010065ae2>] raw_sendmsg+0xaff/0xbd8
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff8881272edc00 (size 1024):
  comm "ping", pid 1159, jiffies 4294810743 (age 120.093s)
  hex dump (first 32 bytes):
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<000000006cd7857e>] __kmalloc_track_caller+0xfe/0x13f
    [<00000000ebec8a26>] __kmalloc_reserve.isra.17+0x2d/0x6d
    [<00000000827ebff0>] pskb_expand_head+0xcc/0x2d1
    [<0000000034bfc15f>] skb_cow_head+0xae/0xb7
    [<000000008f8c30fc>] bpf_lwt_push_ip_encap+0xb1/0x34c
    [<000000007add911b>] bpf_lwt_xmit_push_encap+0x1d/0x29
    [<00000000eecbf798>] ___bpf_prog_run+0xbc3/0x1757
    [<0000000055e881cc>] __bpf_prog_run64+0x42/0x58
    [<00000000d0c1f29b>] run_lwt_bpf.constprop.4+0xff/0x2e6
    [<00000000b9ad5b04>] bpf_xmit+0x3d/0xef
    [<000000003ba220bf>] lwtunnel_xmit+0xc7/0xeb
    [<00000000abfa977f>] ip_finish_output2+0x5b6/0x5e7
    [<000000005a652edf>] ip_finish_output+0x17f/0x191
    [<00000000cded5fe2>] ip_output+0x58/0x88
unreferenced object 0xffff88813355d980 (size 1632):
  comm "ping6", pid 1160, jiffies 4294811768 (age 119.068s)
  hex dump (first 32 bytes):
    00 00 00 00 7f 00 00 06 00 00 00 00 00 00 3a 00  ..............:.
    0a 00 07 41 00 00 00 00 00 00 00 00 00 00 00 00  ...A............
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<000000006287ee1f>] sk_prot_alloc.isra.27+0x30/0xb4
    [<00000000f5b12125>] sk_alloc+0x2e/0x1aa
    [<00000000c2b12d1b>] inet6_create+0x1ae/0x3a7
    [<000000000a3125dc>] __sock_create+0x1c1/0x22a
    [<00000000039d3cb8>] sock_create+0x30/0x32
    [<000000000556b08a>] __sys_socket+0x3d/0xb3
    [<000000007e47d085>] __x64_sys_socket+0x1a/0x1e
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff8881282b1f00 (size 224):
  comm "softirq", pid 0, jiffies 4294811768 (age 119.068s)
  hex dump (first 32 bytes):
    00 10 1c 28 81 88 ff ff 40 ee 25 28 81 88 ff ff  ...(....@.%(....
    c1 7f c8 81 ff ff ff ff 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<00000000dec307f3>] dst_alloc+0x89/0xc1
    [<000000002651f911>] ip6_dst_alloc+0x25/0x63
    [<000000008608d224>] ip6_pol_route+0x201/0x2ae
    [<000000000084a9eb>] ip6_pol_route_output+0x19/0x1b
    [<000000005d385680>] fib6_rule_lookup+0xe7/0x12c
    [<000000001a93c416>] ip6_route_output_flags+0xc5/0xd1
    [<0000000023cbe9f3>] ip6_dst_lookup_tail+0x1a3/0x364
    [<000000003afeb57e>] ip6_dst_lookup_flow+0x47/0x9b
    [<00000000938dbf6b>] rawv6_sendmsg+0x45f/0xdfc
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
unreferenced object 0xffff88812744bf00 (size 224):
  comm "ping6", pid 1160, jiffies 4294811768 (age 119.076s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 30 af 26 81 88 ff ff 80 d9 55 33 81 88 ff ff  .0.&......U3....
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<0000000043c55b9b>] __alloc_skb+0x66/0x1b9
    [<0000000014f706bf>] __ip6_append_data+0x57c/0xc62
    [<0000000016ddc7e9>] ip6_append_data+0x135/0x148
    [<00000000210a1bd5>] rawv6_sendmsg+0xb19/0xdfc
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff888127371c00 (size 1024):
  comm "ping6", pid 1160, jiffies 4294811769 (age 119.075s)
  hex dump (first 32 bytes):
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<000000006cd7857e>] __kmalloc_track_caller+0xfe/0x13f
    [<00000000ebec8a26>] __kmalloc_reserve.isra.17+0x2d/0x6d
    [<00000000827ebff0>] pskb_expand_head+0xcc/0x2d1
    [<0000000034bfc15f>] skb_cow_head+0xae/0xb7
    [<000000008f8c30fc>] bpf_lwt_push_ip_encap+0xb1/0x34c
    [<000000007add911b>] bpf_lwt_xmit_push_encap+0x1d/0x29
    [<00000000eecbf798>] ___bpf_prog_run+0xbc3/0x1757
    [<0000000055e881cc>] __bpf_prog_run64+0x42/0x58
    [<00000000d0c1f29b>] run_lwt_bpf.constprop.4+0xff/0x2e6
    [<00000000b9ad5b04>] bpf_xmit+0x3d/0xef
    [<000000003ba220bf>] lwtunnel_xmit+0xc7/0xeb
    [<000000004c67bd98>] ip6_finish_output2+0x477/0x494
    [<00000000bf642298>] ip6_finish_output+0x106/0x110
    [<00000000365d3055>] ip6_output+0x87/0xbf
unreferenced object 0xffff88812723cf00 (size 224):
  comm "softirq", pid 0, jiffies 4294811770 (age 119.074s)
  hex dump (first 32 bytes):
    00 10 1c 28 81 88 ff ff 40 ee 25 28 81 88 ff ff  ...(....@.%(....
    c1 7f c8 81 ff ff ff ff 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<00000000dec307f3>] dst_alloc+0x89/0xc1
    [<000000002651f911>] ip6_dst_alloc+0x25/0x63
    [<000000008608d224>] ip6_pol_route+0x201/0x2ae
    [<000000000084a9eb>] ip6_pol_route_output+0x19/0x1b
    [<000000005d385680>] fib6_rule_lookup+0xe7/0x12c
    [<000000001a93c416>] ip6_route_output_flags+0xc5/0xd1
    [<0000000023cbe9f3>] ip6_dst_lookup_tail+0x1a3/0x364
    [<000000003afeb57e>] ip6_dst_lookup_flow+0x47/0x9b
    [<00000000938dbf6b>] rawv6_sendmsg+0x45f/0xdfc
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
unreferenced object 0xffff8881273d3f00 (size 224):
  comm "ping6", pid 1160, jiffies 4294811770 (age 119.084s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 30 af 26 81 88 ff ff 80 d9 55 33 81 88 ff ff  .0.&......U3....
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<0000000025231f88>] kmem_cache_alloc+0xd8/0x1fa
    [<0000000043c55b9b>] __alloc_skb+0x66/0x1b9
    [<0000000014f706bf>] __ip6_append_data+0x57c/0xc62
    [<0000000016ddc7e9>] ip6_append_data+0x135/0x148
    [<00000000210a1bd5>] rawv6_sendmsg+0xb19/0xdfc
    [<00000000f564ad0b>] inet_sendmsg+0x3f/0x82
    [<00000000a0a71539>] sock_sendmsg_nosec+0x18/0x2f
    [<0000000025dbe598>] __sys_sendto+0x102/0x143
    [<000000000f989e54>] __x64_sys_sendto+0x28/0x2c
    [<00000000520e974d>] do_syscall_64+0x5c/0x6e
    [<00000000413f2b33>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000005a49f8d9>] 0xffffffffffffffff
unreferenced object 0xffff88812825bc00 (size 1024):
  comm "ping6", pid 1160, jiffies 4294811771 (age 119.083s)
  hex dump (first 32 bytes):
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
    5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
  backtrace:
    [<00000000c4c19340>] kmemleak_alloc+0x70/0x94
    [<000000003251fc6a>] slab_post_alloc_hook+0x47/0x5c
    [<000000006cd7857e>] __kmalloc_track_caller+0xfe/0x13f
    [<00000000ebec8a26>] __kmalloc_reserve.isra.17+0x2d/0x6d
    [<00000000827ebff0>] pskb_expand_head+0xcc/0x2d1
    [<0000000034bfc15f>] skb_cow_head+0xae/0xb7
    [<000000008f8c30fc>] bpf_lwt_push_ip_encap+0xb1/0x34c
    [<000000007add911b>] bpf_lwt_xmit_push_encap+0x1d/0x29
    [<00000000eecbf798>] ___bpf_prog_run+0xbc3/0x1757
    [<0000000055e881cc>] __bpf_prog_run64+0x42/0x58
    [<00000000d0c1f29b>] run_lwt_bpf.constprop.4+0xff/0x2e6
    [<00000000b9ad5b04>] bpf_xmit+0x3d/0xef
    [<000000003ba220bf>] lwtunnel_xmit+0xc7/0xeb
    [<000000004c67bd98>] ip6_finish_output2+0x477/0x494
    [<00000000bf642298>] ip6_finish_output+0x106/0x110
    [<00000000365d3055>] ip6_output+0x87/0xbf

^ permalink raw reply

* [PATCH net-next 0/2] net: phy: add new led configuration for marvell m88e1510
From: Jian Shen @ 2019-02-14  4:31 UTC (permalink / raw)
  To: andrew, f.fainelli, hkallweit1, davem; +Cc: netdev, linux-kernel, linuxarm

Currently, the m88e1510 phy driver used LED[0] and LED[1] for link,
LED[2] for active. It's incompatible with some boards which uses
LED[1] for active. This patchset adds new led configuration for
HNS3 driver.

Jian Shen (2):
  net: phy: marvell: add new m88e1510 LED configuration
  net: hns3: add fixup handle for hns3 driver

 .../ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c    | 19 +++++++++++++++++++
 drivers/net/phy/marvell.c                          | 22 +++++++++++++++++++++-
 include/linux/marvell_phy.h                        |  1 +
 3 files changed, 41 insertions(+), 1 deletion(-)

-- 
1.9.1


^ permalink raw reply

* [PATCH net-next 1/2] net: phy: marvell: add new m88e1510 LED configuration
From: Jian Shen @ 2019-02-14  4:31 UTC (permalink / raw)
  To: andrew, f.fainelli, hkallweit1, davem; +Cc: netdev, linux-kernel, linuxarm
In-Reply-To: <1550118667-119947-1-git-send-email-shenjian15@huawei.com>

The default m88e1510 LED configuration is 0x1177, used LED[0]
for 1000M link, LED[1] for 100M link, and LED[2] for active.
But for our boards, we want to use 0x1040, which use LED[0] for
link, and LED[1] for active.

This patch adds a new m88e1510 LED configuration for it.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
---
 drivers/net/phy/marvell.c   | 22 +++++++++++++++++++++-
 include/linux/marvell_phy.h |  1 +
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index 3ccba37..c195286 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -128,6 +128,10 @@
 #define MII_PHY_LED_CTRL	        16
 #define MII_88E1121_PHY_LED_DEF		0x0030
 #define MII_88E1510_PHY_LED_DEF		0x1177
+#define MII_88E1510_PHY_HNS3_LED_DEF	0x1040
+
+#define MII_88E1510_PHY_LED_POLARITY_CTRL	0x11
+#define MII_88E1510_PHY_HNS3_LED_POLARITY	0x4415
 
 #define MII_M1011_PHY_STATUS		0x11
 #define MII_M1011_PHY_STATUS_1000	0x8000
@@ -619,12 +623,19 @@ static void marvell_config_led(struct phy_device *phydev)
 		def_config = MII_88E1121_PHY_LED_DEF;
 		break;
 	/* Default PHY LED config:
+	 * For hns3:
+	 * LED[0] .. Link
+	 * LED[1] .. Activity
+	 * For others:
 	 * LED[0] .. 1000Mbps Link
 	 * LED[1] .. 100Mbps Link
 	 * LED[2] .. Blink, Activity
 	 */
 	case MARVELL_PHY_FAMILY_ID(MARVELL_PHY_ID_88E1510):
-		def_config = MII_88E1510_PHY_LED_DEF;
+		if (phydev->dev_flags & MARVELL_PHY_M1510_HNS3_LEDS)
+			def_config = MII_88E1510_PHY_HNS3_LED_DEF;
+		else
+			def_config = MII_88E1510_PHY_LED_DEF;
 		break;
 	default:
 		return;
@@ -634,6 +645,15 @@ static void marvell_config_led(struct phy_device *phydev)
 			      def_config);
 	if (err < 0)
 		phydev_warn(phydev, "Fail to config marvell phy LED.\n");
+
+	if (phydev->dev_flags & MARVELL_PHY_M1510_HNS3_LEDS) {
+		err = phy_write_paged(phydev, MII_MARVELL_LED_PAGE,
+				      MII_88E1510_PHY_LED_POLARITY_CTRL,
+				      MII_88E1510_PHY_HNS3_LED_POLARITY);
+		if (err < 0)
+			phydev_warn(phydev,
+				    "Fail to config marvell phy LED polarity.\n");
+	}
 }
 
 static int marvell_config_init(struct phy_device *phydev)
diff --git a/include/linux/marvell_phy.h b/include/linux/marvell_phy.h
index 1eb6f24..99e0bbb 100644
--- a/include/linux/marvell_phy.h
+++ b/include/linux/marvell_phy.h
@@ -32,5 +32,6 @@
 /* struct phy_device dev_flags definitions */
 #define MARVELL_PHY_M1145_FLAGS_RESISTANCE	0x00000001
 #define MARVELL_PHY_M1118_DNS323_LEDS		0x00000002
+#define MARVELL_PHY_M1510_HNS3_LEDS		0x00000004
 
 #endif /* _MARVELL_PHY_H */
-- 
1.9.1


^ permalink raw reply related

* [PATCH net-next 2/2] net: hns3: add fixup handle for hns3 driver
From: Jian Shen @ 2019-02-14  4:31 UTC (permalink / raw)
  To: andrew, f.fainelli, hkallweit1, davem; +Cc: netdev, linux-kernel, linuxarm
In-Reply-To: <1550118667-119947-1-git-send-email-shenjian15@huawei.com>

The default led configuration of marvell 88E1510 is not fit
for hns3 driver, this patch fixes it.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
---
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c   | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
index 84f2878..4c8346e 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
@@ -2,6 +2,7 @@
 // Copyright (c) 2016-2017 Hisilicon Limited.
 
 #include <linux/etherdevice.h>
+#include <linux/marvell_phy.h>
 #include <linux/kernel.h>
 
 #include "hclge_cmd.h"
@@ -125,6 +126,13 @@ static int hclge_mdio_read(struct mii_bus *bus, int phyid, int regnum)
 	return le16_to_cpu(mdio_cmd->data_rd);
 }
 
+static int hclge_phy_marvell_fixup(struct phy_device *phydev)
+{
+	phydev->dev_flags |= MARVELL_PHY_M1510_HNS3_LEDS;
+
+	return 0;
+}
+
 int hclge_mac_mdio_config(struct hclge_dev *hdev)
 {
 	struct hclge_mac *mac = &hdev->hw.mac;
@@ -168,6 +176,15 @@ int hclge_mac_mdio_config(struct hclge_dev *hdev)
 	mac->phydev = phydev;
 	mac->mdio_bus = mdio_bus;
 
+	/* register the PHY board fixup (for Marvell 88E1510) */
+	ret = phy_register_fixup_for_uid(MARVELL_PHY_ID_88E1510,
+					 MARVELL_PHY_ID_MASK,
+					 hclge_phy_marvell_fixup);
+	/* we can live without it, so just issue a warning */
+	if (ret)
+		dev_warn(&hdev->pdev->dev,
+			 "Cannot register PHY board fixup\n");
+
 	return 0;
 }
 
@@ -240,6 +257,8 @@ void hclge_mac_disconnect_phy(struct hnae3_handle *handle)
 	if (!phydev)
 		return;
 
+	phy_unregister_fixup_for_uid(MARVELL_PHY_ID_88E1510,
+				     MARVELL_PHY_ID_MASK);
 	phy_disconnect(phydev);
 }
 
-- 
1.9.1


^ permalink raw reply related

* Re: [PATCH 2/2] doc: add phylink documentation to the networking book
From: Randy Dunlap @ 2019-02-14  4:00 UTC (permalink / raw)
  To: Russell King, linux-doc, netdev; +Cc: David S. Miller, Jonathan Corbet
In-Reply-To: <E1gr376-0007ea-NV@rmk-PC.armlinux.org.uk>

On 2/5/19 7:58 AM, Russell King wrote:
> Add some phylink documentation to the networking book detailing how
> to convert network drivers from phylib to phylink.
> 
> Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
> ---
> Version 2 adds the "Modes of operation" section, as it appears mvpp2 is
> non-conformant (which is, unfortunately, causing problems in certain
> circumstances.)
> 
>  Documentation/networking/index.rst       |   1 +
>  Documentation/networking/sfp-phylink.rst | 268 +++++++++++++++++++++++++++++++
>  2 files changed, 269 insertions(+)
>  create mode 100644 Documentation/networking/sfp-phylink.rst
> 

> diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst
> new file mode 100644
> index 000000000000..78a577c9d8a3
> --- /dev/null
> +++ b/Documentation/networking/sfp-phylink.rst
> @@ -0,0 +1,268 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=======
> +phylink
> +=======
> +
> +Overview
> +========
> +
> +phylink is a mechanism to support hot-pluggable networking modules
> +without needing to re-initialise the adapter on hot-plug events.
> +
> +phylink supports conventional phylib-based setups, fixed link setups
> +and SFP modules at present.

Please tell what SFP means.
It would also be nice if net/phy/Kconfig told what SFP means.

> +
> +Modes of operation
> +==================
> +
> +phylink has several modes of operation, which depend on the firmware
> +settings.
> +
> +1. PHY mode
> +
> +   In PHY mode, we use phylib to read the current link settings from
> +   the PHY, and pass them to the MAC driver.  We expect the MAC driver
> +   to configure exactly the modes that are specified without any
> +   negotiation being enabled on the link.
> +
> +2. Fixed mode
> +
> +   Fixed mode is the same as PHY mode as far as the MAC driver is
> +   concerned.
> +
> +3. In-band mode
> +
> +   In-band mode is used with 802.3z, SGMII and similar interface modes

should "with" be "when"?

> +   are used, and we are expecting to use the and honor the in-band

eh?                                  ^^^^^^^^^^^^^^^^^^^^

> +   negotiation or control word sent across the serdes channel.
> +
> +By example, what this means is that:
> +
> +.. code-block:: none
> +
> +  &eth {
> +    phy = <&phy>;
> +    phy-mode = "sgmii";
> +  };
> +
> +does not use in-band SGMII signalling.  The PHY is expected to follow
> +exactly the settings given to it in its :c:func:`mac_config` function.
> +The link should be forced up or down appropriately in the
> +:c:func:`mac_link_up` and :c:func:`mac_link_down` functions.
> +
> +.. code-block:: none
> +
> +  &eth {
> +    managed = "in-band-status";
> +    phy = <&phy>;
> +    phy-mode = "sgmii";
> +  };
> +
> +uses in-band mode, where results from the PHYs negotiation are passed

                                             PHY's

> +to the MAC through the SGMII control word, and the MAC is expected to
> +acknowledge the control word.  The :c:func:`mac_link_up` and
> +:c:func:`mac_link_down` functions must not force the MAC side link
> +up and down.
> +
> +Rough guide to converting a network driver to sfp/phylink
> +=========================================================
> +
> +This guide briefly describes how to convert a network driver from
> +phylib to the sfp/phylink support.  Please send patches to improve
> +this documentation.
> +
> +1. Optionally split the network driver's phylib update function into
> +   three parts dealing with link-down, link-up and reconfiguring the
> +   MAC settings. This can be done as a separate preparation commit.
> +
> +   An example of this preparation can be found in git commit fc548b991fb0.
> +
> +2. Replace::
> +
> +	select FIXED_PHY
> +	select PHYLIB
> +
> +   with::
> +
> +	select PHYLINK
> +
> +   in the driver's Kconfig stanza.
> +
> +3. Add::
> +
> +	#include <linux/phylink.h>
> +
> +   to the driver's list of header files.
> +
> +4. Add::
> +
> +	struct phylink *phylink;
> +
> +   to the driver's private data structure.  We shall refer to the
> +   driver's private data pointer as ``priv`` below, and the driver's
> +   private data structure as ``struct foo_priv``.
> +
> +5. Replace the following functions:
> +
> +   .. flat-table::
> +    :header-rows: 1
> +    :widths: 1 1
> +    :stub-columns: 0
> +
> +    * - Original function
> +      - Replacement function
> +    * - phy_start(phydev)
> +      - phylink_start(priv->phylink)
> +    * - phy_stop(phydev)
> +      - phylink_stop(priv->phylink)
> +    * - phy_mii_ioctl(phydev, ifr, cmd)
> +      - phylink_mii_ioctl(priv->phylink, ifr, cmd)
> +    * - phy_ethtool_get_wol(phydev, wol)
> +      - phylink_ethtool_get_wol(priv->phylink, wol)
> +    * - phy_ethtool_set_wol(phydev, wol)
> +      - phylink_ethtool_set_wol(priv->phylink, wol)
> +    * - phy_disconnect(phydev)
> +      - phylink_disconnect_phy(priv->phylink)
> +
> +   Please note that some of these functions must be called under the
> +   rtnl lock, and will warn if not. This will normally be the case,
> +   except if these are called from the driver suspend/resume paths.
> +
> +6. Add/replace ksettings get/set methods with:
> +
> +   .. code-block:: c
> +
> +    static int foo_ethtool_set_link_ksettings(struct net_device *dev,
> +					     const struct ethtool_link_ksettings *cmd)
> +    {
> +	struct foo_priv *priv = netdev_priv(dev);
> +
> +	return phylink_ethtool_ksettings_set(priv->phylink, cmd);
> +    }
> +
> +    static int foo_ethtool_get_link_ksettings(struct net_device *dev,
> +					     struct ethtool_link_ksettings *cmd)
> +    {
> +	struct foo_priv *priv = netdev_priv(dev);
> +
> +	return phylink_ethtool_ksettings_get(priv->phylink, cmd);
> +    }
> +
> +7. Replace the call to:
> +
> +	phy_dev = of_phy_connect(dev, node, link_func, flags, phy_interface)

add ending ';' above.

> +
> +   and associated code with a call to:
> +
> +	err = phylink_of_phy_connect(priv->phylink, node, flags)

ditto.

> +
> +   For the most part, ``flags`` can be zero, these flags are passed to

                                          zero;

> +   the of_phy_attach() inside this function call if a PHY is specified
> +   in the DT node ``node``.
> +
> +   ``node`` should be the DT node which contains the network phy property,
> +   fixed link properties, and will also contain the sfp property.
> +
> +   The setup of fixed links should also be removed; these are handled
> +   natively by phylink.

      internally?

> +
> +   of_phy_connect() was also passed a function pointer for link updates.
> +   This function is replaced by a different form of MAC updates
> +   described below in (8).
> +
> +   Manipulation of the PHY's supported/advertised happens within phylink

                          PHYs

> +   based on the validate callback, see below in (8).
> +
> +   Note that the driver no longer needs to store the ``phy_interface``,
> +   and also note that ``phy_interface`` becomes a dynamic property,
> +   just like the speed, duplex etc settings.

                                  etc.

> +
> +   Finally, note that the MAC driver has no direct access to the PHY
> +   anymore; that is because in the phylink model, the PHY can be
> +   dynamic.
> +
> +8. Add a :c:type:`struct phylink_mac_ops <phylink_mac_ops>` instance to
> +   the driver, which is a table of function pointers, and implement
> +   these functions. The old link update function for
> +   :c:func:`of_phy_connect` becomes three methods: :c:func:`mac_link_up`,
> +   :c:func:`mac_link_down`, and :c:func:`mac_config`. If step 1 was
> +   performed, then the functionality will have been split there.
> +
> +   It is important that if in-band negotiation is used,
> +   :c:func:`mac_link_up` and :c:func:`mac_link_down` do not prevent the
> +   in-band negotiation from completing, since these functions are called
> +   when the in-band link state changes - otherwise the link will never
> +   come up.
> +
> +   The :c:func:`validate` method should mask the supplied supported mask,
> +   and ``state->advertising`` with the supported ethtool link modes.
> +   These are the new ethtool link modes, so bitmask operations must be
> +   used. For an example, see drivers/net/ethernet/marvell/mvneta.c.
> +
> +   The :c:func:`mac_link_state` method is used to read the link state
> +   from the MAC, and report back the settings that the MAC is currently
> +   using. This is particularly important for in-band negotiation
> +   methods such as 1000base-X and SGMII.
> +
> +   The :c:func:`mac_config` method is used to update the MAC with the
> +   requested state, and must avoid unnecessarily taking the link down
> +   when making changes to the MAC configuration.  This means the
> +   function should modify the state and only take the link down when
> +   absolutely necessary to change the MAC configuration.  An example
> +   of how to do this can be found in :c:func:`mvneta_mac_config` in
> +   drivers/net/ethernet/marvell/mvneta.c.
> +
> +   For further information on these methods, please see the inline
> +   documentation in :c:type:`struct phylink_mac_ops <phylink_mac_ops>`.
> +
> +9. Remove calls to of_parse_phandle() for the PHY,
> +   of_phy_register_fixed_link() for fixed links etc from the probe

                                                   etc.

> +   function, and replace with:
> +
> +   .. code-block:: c
> +
> +	struct phylink *phylink;
> +
> +	phylink = phylink_create(dev, node, phy_mode, &phylink_ops);
> +	if (IS_ERR(phylink)) {
> +		err = PTR_ERR(phylink);
> +		fail probe;
> +	}
> +
> +	priv->phylink = phylink;
> +
> +   and arrange to destroy the phylink in the probe failure path as
> +   appropriate and the removal path too by calling:
> +
> +   .. code-block:: c
> +
> +	phylink_destroy(priv->phylink);
> +
> +10. Arrange for MAC link state interrupts to be forwarded into
> +    phylink, via:
> +
> +    .. code-block:: c
> +
> +	phylink_mac_change(priv->phylink, link_is_up);
> +
> +    where ``link_is_up`` is true if the link is currently up or false
> +    otherwise.
> +
> +11. Verify that the driver does not call::
> +
> +	netif_carrier_on()
> +	netif_carrier_off()
> +
> +   as these will interfere with phylink's tracking of the link state,
> +   and cause phylink to omit calls via the :c:func:`mac_link_up` and
> +   :c:func:`mac_link_down` methods.
> +
> +Network drivers should call phylink_stop() and phylink_start() via their
> +suspend/resume paths, which ensures that the appropriate
> +:c:type:`struct phylink_mac_ops <phylink_mac_ops>` methods are called
> +as necessary.
> +
> +For information describing the SFP cage in DT, please see the binding
> +documentation in the kernel source tree
> +``Documentation/devicetree/bindings/net/sff,sfp.txt``
oh, so SFP means "Small Form-factor Pluggable".

I see that this source file:
./drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:1902:

seems to imply that SFP means "single function per port (SFP) mode":

	dev_err(&pf->pdev->dev,
		"VF %d requested polling mode: this feature is supported only when the device is running in single function per port (SFP) mode\n",
		 vf->vf_id);


Good job overall.  Thanks.

-- 
~Randy

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: phy: marvell: add new m88e1510 LED configuration
From: Florian Fainelli @ 2019-02-14  4:06 UTC (permalink / raw)
  To: Jian Shen, andrew, hkallweit1, davem; +Cc: netdev, linux-kernel, linuxarm
In-Reply-To: <1550118667-119947-2-git-send-email-shenjian15@huawei.com>



On 2/13/2019 8:31 PM, Jian Shen wrote:
> The default m88e1510 LED configuration is 0x1177, used LED[0]
> for 1000M link, LED[1] for 100M link, and LED[2] for active.
> But for our boards, we want to use 0x1040, which use LED[0] for
> link, and LED[1] for active.
> 
> This patch adds a new m88e1510 LED configuration for it.

There appears to be a precedent with the DNS323 flag that was defined
for the same purpose, but this unfortunately does not scale we cannot
have every new platform come up with its own LED configuration without
having a more structured approach to representing the LED configuration.

Maybe we can encode the desired LED behavior in a more generic way and
utilize the 32 flag bits available to denote a selection, e.g.:

MARVELL_PHY_FLAG_LED0_100M	BIT(3)
MARVELL_PHY_FLAG_LED0_1000M	BIT(4)

etc.

or maybe even better would be to expose the LEDs using the standard LEDs
class subsystem and allow configuring different triggers. We have some
amount of support for PHY LEDs already in tree, but AFAIR what we do not
have support for is a "hardware blinking" trigger which those LEDs are.

> 
> Signed-off-by: Jian Shen <shenjian15@huawei.com>
> ---
>  drivers/net/phy/marvell.c   | 22 +++++++++++++++++++++-
>  include/linux/marvell_phy.h |  1 +
>  2 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
> index 3ccba37..c195286 100644
> --- a/drivers/net/phy/marvell.c
> +++ b/drivers/net/phy/marvell.c
> @@ -128,6 +128,10 @@
>  #define MII_PHY_LED_CTRL	        16
>  #define MII_88E1121_PHY_LED_DEF		0x0030
>  #define MII_88E1510_PHY_LED_DEF		0x1177
> +#define MII_88E1510_PHY_HNS3_LED_DEF	0x1040
> +
> +#define MII_88E1510_PHY_LED_POLARITY_CTRL	0x11
> +#define MII_88E1510_PHY_HNS3_LED_POLARITY	0x4415
>  
>  #define MII_M1011_PHY_STATUS		0x11
>  #define MII_M1011_PHY_STATUS_1000	0x8000
> @@ -619,12 +623,19 @@ static void marvell_config_led(struct phy_device *phydev)
>  		def_config = MII_88E1121_PHY_LED_DEF;
>  		break;
>  	/* Default PHY LED config:
> +	 * For hns3:
> +	 * LED[0] .. Link
> +	 * LED[1] .. Activity
> +	 * For others:
>  	 * LED[0] .. 1000Mbps Link
>  	 * LED[1] .. 100Mbps Link
>  	 * LED[2] .. Blink, Activity
>  	 */
>  	case MARVELL_PHY_FAMILY_ID(MARVELL_PHY_ID_88E1510):
> -		def_config = MII_88E1510_PHY_LED_DEF;
> +		if (phydev->dev_flags & MARVELL_PHY_M1510_HNS3_LEDS)
> +			def_config = MII_88E1510_PHY_HNS3_LED_DEF;
> +		else
> +			def_config = MII_88E1510_PHY_LED_DEF;
>  		break;
>  	default:
>  		return;
> @@ -634,6 +645,15 @@ static void marvell_config_led(struct phy_device *phydev)
>  			      def_config);
>  	if (err < 0)
>  		phydev_warn(phydev, "Fail to config marvell phy LED.\n");
> +
> +	if (phydev->dev_flags & MARVELL_PHY_M1510_HNS3_LEDS) {
> +		err = phy_write_paged(phydev, MII_MARVELL_LED_PAGE,
> +				      MII_88E1510_PHY_LED_POLARITY_CTRL,
> +				      MII_88E1510_PHY_HNS3_LED_POLARITY);
> +		if (err < 0)
> +			phydev_warn(phydev,
> +				    "Fail to config marvell phy LED polarity.\n");
> +	}
>  }
>  
>  static int marvell_config_init(struct phy_device *phydev)
> diff --git a/include/linux/marvell_phy.h b/include/linux/marvell_phy.h
> index 1eb6f24..99e0bbb 100644
> --- a/include/linux/marvell_phy.h
> +++ b/include/linux/marvell_phy.h
> @@ -32,5 +32,6 @@
>  /* struct phy_device dev_flags definitions */
>  #define MARVELL_PHY_M1145_FLAGS_RESISTANCE	0x00000001
>  #define MARVELL_PHY_M1118_DNS323_LEDS		0x00000002
> +#define MARVELL_PHY_M1510_HNS3_LEDS		0x00000004
>  
>  #endif /* _MARVELL_PHY_H */
> 

-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: hns3: add fixup handle for hns3 driver
From: Florian Fainelli @ 2019-02-14  4:08 UTC (permalink / raw)
  To: Jian Shen, andrew, hkallweit1, davem; +Cc: netdev, linux-kernel, linuxarm
In-Reply-To: <1550118667-119947-3-git-send-email-shenjian15@huawei.com>



On 2/13/2019 8:31 PM, Jian Shen wrote:
> The default led configuration of marvell 88E1510 is not fit
> for hns3 driver, this patch fixes it.
> 
> Signed-off-by: Jian Shen <shenjian15@huawei.com>
> ---
>  .../net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c   | 19 +++++++++++++++++++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
> index 84f2878..4c8346e 100644
> --- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
> +++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_mdio.c
> @@ -2,6 +2,7 @@
>  // Copyright (c) 2016-2017 Hisilicon Limited.
>  
>  #include <linux/etherdevice.h>
> +#include <linux/marvell_phy.h>
>  #include <linux/kernel.h>
>  
>  #include "hclge_cmd.h"
> @@ -125,6 +126,13 @@ static int hclge_mdio_read(struct mii_bus *bus, int phyid, int regnum)
>  	return le16_to_cpu(mdio_cmd->data_rd);
>  }
>  
> +static int hclge_phy_marvell_fixup(struct phy_device *phydev)
> +{
> +	phydev->dev_flags |= MARVELL_PHY_M1510_HNS3_LEDS;
> +
> +	return 0;
> +}
> +
>  int hclge_mac_mdio_config(struct hclge_dev *hdev)
>  {
>  	struct hclge_mac *mac = &hdev->hw.mac;
> @@ -168,6 +176,15 @@ int hclge_mac_mdio_config(struct hclge_dev *hdev)
>  	mac->phydev = phydev;
>  	mac->mdio_bus = mdio_bus;
>  
> +	/* register the PHY board fixup (for Marvell 88E1510) */
> +	ret = phy_register_fixup_for_uid(MARVELL_PHY_ID_88E1510,
> +					 MARVELL_PHY_ID_MASK,
> +					 hclge_phy_marvell_fixup);
> +	/* we can live without it, so just issue a warning */
> +	if (ret)
> +		dev_warn(&hdev->pdev->dev,
> +			 "Cannot register PHY board fixup\n");

You don't need to register a fixup for passing your flags, you can do
that at the time you attach to the PHY:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/phy.h#n945


> +
>  	return 0;
>  }
>  
> @@ -240,6 +257,8 @@ void hclge_mac_disconnect_phy(struct hnae3_handle *handle)
>  	if (!phydev)
>  		return;
>  
> +	phy_unregister_fixup_for_uid(MARVELL_PHY_ID_88E1510,
> +				     MARVELL_PHY_ID_MASK);
>  	phy_disconnect(phydev);
>  }
>  
> 

-- 
Florian

^ permalink raw reply

* Re: [PATCH net] net: phy: fix interrupt handling in non-started states
From: Florian Fainelli @ 2019-02-14  4:10 UTC (permalink / raw)
  To: Heiner Kallweit, Andrew Lunn, David Miller
  Cc: netdev@vger.kernel.org, Russell King - ARM Linux
In-Reply-To: <25e86edc-0b88-8c03-b692-776e971331f2@gmail.com>



On 2/12/2019 10:56 AM, Heiner Kallweit wrote:
> phylib enables interrupts before phy_start() has been called, and if
> we receive an interrupt in a non-started state, the interrupt handler
> returns IRQ_NONE. This causes problems with at least one Marvell chip
> as reported by Andrew.
> Fix this by handling interrupts the same as in phy_mac_interrupt(),
> basically always running the phylib state machine. It knows when it
> has to do something and when not.
> This change allows to handle interrupts gracefully even if they
> occur in a non-started state.
> 
> Fixes: 2b3e88ea6528 ("net: phy: improve phy state checking")
> Reported-by: Andrew Lunn <andrew@lunn.ch>
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox