Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net] bpf: enforce correct alignment for instructions
From: David Miller @ 2018-06-21  3:46 UTC (permalink / raw)
  To: edumazet; +Cc: netdev, eric.dumazet, daniel, kafai, ast
In-Reply-To: <20180621002409.63136-1-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Wed, 20 Jun 2018 17:24:09 -0700

> After commit 9facc336876f ("bpf: reject any prog that failed read-only lock")
> offsetof(struct bpf_binary_header, image) became 3 instead of 4,
> breaking powerpc BPF badly, since instructions need to be word aligned.
> 
> Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

I'll apply this directly, thanks Eric.

^ permalink raw reply

* Re: [PATCH net] sctp: fix erroneous inc of snmp SctpFragUsrMsgs
From: David Miller @ 2018-06-21  3:49 UTC (permalink / raw)
  To: marcelo.leitner; +Cc: netdev, linux-sctp, nhorman, vyasevich
In-Reply-To: <d89c1e422158d21710ce938aa093a20960bd55e9.1529509634.git.marcelo.leitner@gmail.com>

From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Wed, 20 Jun 2018 12:47:52 -0300

> Currently it is incrementing SctpFragUsrMsgs when the user message size
> is of the exactly same size as the maximum fragment size, which is wrong.
> 
> The fix is to increment it only when user message is bigger than the
> maximum fragment size.
> 
> Fixes: bfd2e4b8734d ("sctp: refactor sctp_datamsg_from_user")
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net] bpf: enforce correct alignment for instructions
From: Eric Dumazet @ 2018-06-21  4:08 UTC (permalink / raw)
  To: David Miller, edumazet; +Cc: netdev, eric.dumazet, daniel, kafai, ast
In-Reply-To: <20180621.124639.2276956745930721354.davem@davemloft.net>



On 06/20/2018 08:46 PM, David Miller wrote:
> From: Eric Dumazet <edumazet@google.com>
> Date: Wed, 20 Jun 2018 17:24:09 -0700
> 
>> After commit 9facc336876f ("bpf: reject any prog that failed read-only lock")
>> offsetof(struct bpf_binary_header, image) became 3 instead of 4,
>> breaking powerpc BPF badly, since instructions need to be word aligned.
>>
>> Fixes: 9facc336876f ("bpf: reject any prog that failed read-only lock")
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
> 
> I'll apply this directly, thanks Eric.
> 

Thanks David :)

^ permalink raw reply

* [PATCH net] ipvlan: fix IFLA_MTU ignored on NEWLINK
From: Xin Long @ 2018-06-21  4:56 UTC (permalink / raw)
  To: network dev; +Cc: davem, Jarod Wilson, Mahesh Bandewar

Commit 296d48568042 ("ipvlan: inherit MTU from master device") adjusted
the mtu from the master device when creating a ipvlan device, but it
would also override the mtu value set in rtnl_create_link. It causes
IFLA_MTU param not to take effect.

So this patch is to not adjust the mtu if IFLA_MTU param is set when
creating a ipvlan device.

Fixes: 296d48568042 ("ipvlan: inherit MTU from master device")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 drivers/net/ipvlan/ipvlan_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index d02f0a7..23c1d660 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -594,7 +594,8 @@ int ipvlan_link_new(struct net *src_net, struct net_device *dev,
 	ipvlan->phy_dev = phy_dev;
 	ipvlan->dev = dev;
 	ipvlan->sfeatures = IPVLAN_FEATURES;
-	ipvlan_adjust_mtu(ipvlan, phy_dev);
+	if (!tb[IFLA_MTU])
+		ipvlan_adjust_mtu(ipvlan, phy_dev);
 	INIT_LIST_HEAD(&ipvlan->addrs);
 	spin_lock_init(&ipvlan->addrs_lock);
 
-- 
2.1.0

^ permalink raw reply related

* [PATCH net] vhost_net: validate sock before trying to put its fd
From: Jason Wang @ 2018-06-21  5:11 UTC (permalink / raw)
  To: mst, kvm, virtualization, netdev, linux-kernel; +Cc: dan.carpenter

Sock will be NULL if we pass -1 to vhost_net_set_backend(), but when
we meet errors during ubuf allocation, the code does not check for
NULL before calling sockfd_put(), this will lead NULL
dereferencing. Fixing by checking sock pointer before.

Fixes: bab632d69ee4 ("vhost: vhost TX zero-copy support")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 986058a..b97a994 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1208,7 +1208,8 @@ static long vhost_net_set_backend(struct vhost_net *n, unsigned index, int fd)
 	if (ubufs)
 		vhost_net_ubuf_put_wait_and_free(ubufs);
 err_ubufs:
-	sockfd_put(sock);
+	if (sock)
+		sockfd_put(sock);
 err_vq:
 	mutex_unlock(&vq->mutex);
 err:
-- 
2.7.4

^ permalink raw reply related

* Re: Route fallback issue
From: Grant Taylor @ 2018-06-21  5:13 UTC (permalink / raw)
  To: Julian Anastasov, Akshat Kakkar
  Cc: netdev, cronolog+lartc, lartc, Erik Auerswald
In-Reply-To: <alpine.LFD.2.20.1806202139330.1969@ja.home.ssi.bg>

On 06/20/2018 01:00 PM, Julian Anastasov wrote:
> You can also try alternative routes.

"Alternative routes"?  I can't say as I've heard that description as a 
specific technique / feature / capability before.

Is that it's official name?

Where can I find out more about it?

> But as the kernel supports only default alternative routes, you can put 
> them in their own table:

I don't know that that is the case any more.

I was able to issue the following commands without a problem:

# ip route append 192.0.2.128/26 via 192.0.2.62
# ip route append 192.0.2.128/26 via 192.0.2.126

I crated two network namespaces and had a pair of vEths between them 
(192.0.2.0/26 and 192.0.2.64/26).  I added a dummy network to each NetNS 
(192.0.2.128/26 and 192.0.2.192/26).

I ran the following commands while a persistent ping was running from 
one NetNS to the IP on the other's dummy0 interface:

# ip link set ns2b up && ip route append 192.0.2.192/26 via 192.0.2.126 
&& ip link set ns2a down
(pause and watch things)
# ip link set ns2a up && ip route append 192.0.2.192/26 via 192.0.2.62 
&& ip link set ns2b down
(pause and watch things)

I could iterate between the two above commands and pings continued to work.

So, I think that it's now possible to use "alternate routes" (new to me) 
on specific prefixes in addition to the default.  Thus there is no 
longer any need for a separate table and the associated IP rule.

I'm running kernel version 4.9.76.

I did go ahead and set net.ipv4.conf.ns2b.ignore_routes_with_linkdown to 1.

for i in /proc/sys/net/ipv4/conf/*/ignore_routes_with_linkdown; do echo 
1 > $i; done

Doing that dropped the number of dropped pings from 60 ~ 90 (1 / second) 
to 0 ~ 5 (1 / second).  (Rarely, maybe 1 out of 20 flips, would it take 
upwards of 10 pings / seconds.)

> # Alternative routes use same metric!!!
> ip route append default via 192.168.1.254 dev eno1 table 100
> ip route append default via 192.168.2.254 dev eno2 table 100
> ip rule add prio 100 to 172.16.0.0/12 table 100

I did have to "append" the route.  I couldn't just "add" the route. 
When I tried to "add" the second route, I got an error about the route 
already existing.  Using "append" instead of "add" with everything else 
the same worked just fine.

Note:  I did go ahead and remove the single route that was added via 
"add" and used "append" for both.

> Of course, you will get better results if an user space tool puts only 
> alive routes in service after doing health checks of all near gateways.

I've got to say, with as well as this is working, I don't feel any need 
for a user space monitoring daemon.  I agree that I've felt the need for 
such in the past before I learned about "alternative routes".

I still want to learn more about "alternative routes".

Here's a diagram of the test network if someone wants to try to 
reproduce my findings:

+-------------+                +-------------+
| NS1         |                |         NS2 |
|        ns2a +-----vEth-A-----+ ns1a        |
|             |                |             |
+ dummy0      |                |      dummy0 +
|             |                |             |
|        ns2b +-----vEth-B-----+ ns1b        |
|             |                |             |
+-------------+                +-------------+

(vEths get the name of the NS that they face.)

NS1:ns2a     192.0.2.1     /26
NS1:ns2b     192.0.2.65    /26
NS1:dummy0   192.0.2.129   /26
NS2:ns1a     192.0.2.62    /26
NS2:ns1b     192.0.2.126   /26
NS2:dummy0   192.0.2.254   /26



-- 
Grant. . . .
unix || die

^ permalink raw reply

* Re: [PATCH bpf 1/2] tools: bpftool: remove duplicated error message on prog load
From: Song Liu @ 2018-06-21  5:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, Networking, oss-drivers
In-Reply-To: <20180620184246.18672-2-jakub.kicinski@netronome.com>

On Wed, Jun 20, 2018 at 11:42 AM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> do_pin_fd() will already print out an error message if something
> goes wrong.  Printing another error is unnecessary and will break
> JSON output, since error messages are full objects:
>
> $ bpftool -jp prog load tracex1_kern.o /sys/fs/bpf/a
> {
>     "error": "can't pin the object (/sys/fs/bpf/a): File exists"
> },{
>     "error": "failed to pin program"
> }
>
> Fixes: 49a086c201a9 ("bpftool: implement prog load command")
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  tools/bpf/bpftool/prog.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index 05f42a46d6ed..12b694fe0404 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -694,10 +694,8 @@ static int do_load(int argc, char **argv)
>                 return -1;
>         }
>
> -       if (do_pin_fd(prog_fd, argv[1])) {
> -               p_err("failed to pin program");
> +       if (do_pin_fd(prog_fd, argv[1]))
>                 return -1;
> -       }
>
>         if (json_output)
>                 jsonw_null(json_wtr);
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH bpf 2/2] tools: bpftool: remember to close the libbpf object after prog load
From: Song Liu @ 2018-06-21  5:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Daniel Borkmann, Networking, oss-drivers
In-Reply-To: <20180620184246.18672-3-jakub.kicinski@netronome.com>

On Wed, Jun 20, 2018 at 11:42 AM, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> Remembering to close all descriptors and free memory may not seem
> important in a user space tool like bpftool, but if we were to run
> in batch mode the consumed resources start to add up quickly.  Make
> sure program load closes the libbpf object (which unloads and frees
> it).
>
> Fixes: 49a086c201a9 ("bpftool: implement prog load command")
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>

Acked-by: Song Liu <songliubraving@fb.com>

> ---
>  tools/bpf/bpftool/prog.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
> index 12b694fe0404..959aa53ab678 100644
> --- a/tools/bpf/bpftool/prog.c
> +++ b/tools/bpf/bpftool/prog.c
> @@ -695,12 +695,18 @@ static int do_load(int argc, char **argv)
>         }
>
>         if (do_pin_fd(prog_fd, argv[1]))
> -               return -1;
> +               goto err_close_obj;
>
>         if (json_output)
>                 jsonw_null(json_wtr);
>
> +       bpf_object__close(obj);
> +
>         return 0;
> +
> +err_close_obj:
> +       bpf_object__close(obj);
> +       return -1;
>  }
>
>  static int do_help(int argc, char **argv)
> --
> 2.17.1
>

^ permalink raw reply

* Re: [Y2038] [PATCH] ceph: use ktime_get_real_seconds()
From: Allen @ 2018-06-21  5:41 UTC (permalink / raw)
  To: arnd; +Cc: netdev, y2038, ceph-devel, David Miller, linux-kernel
In-Reply-To: <CAK8P3a17ui5QuvPzgWmQby0zVcxSKpU_0hbB7_Gdt6i39oo5hQ@mail.gmail.com>

> > Signed-off-by: Allen Pais <allen.lkml@gmail.com>
>
> I have done a similar patch and will post it soon along with the rest of the
> ceph y2038 series. Please have a look at "ceph: use timespec64 in for
> keepalive" and comment if you see something that I missed.
>

Oh Okay. Thank you Arnd.

^ permalink raw reply

* Re: [PATCH net] ipvlan: fix IFLA_MTU ignored on NEWLINK
From: David Miller @ 2018-06-21  5:58 UTC (permalink / raw)
  To: lucien.xin; +Cc: netdev, jarod, maheshb
In-Reply-To: <c6b5afd356eb6727f7f1f8b5cfc872c187e1c813.1529556964.git.lucien.xin@gmail.com>

From: Xin Long <lucien.xin@gmail.com>
Date: Thu, 21 Jun 2018 12:56:04 +0800

> Commit 296d48568042 ("ipvlan: inherit MTU from master device") adjusted
> the mtu from the master device when creating a ipvlan device, but it
> would also override the mtu value set in rtnl_create_link. It causes
> IFLA_MTU param not to take effect.
> 
> So this patch is to not adjust the mtu if IFLA_MTU param is set when
> creating a ipvlan device.
> 
> Fixes: 296d48568042 ("ipvlan: inherit MTU from master device")
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH] net: vhost: improve performance when enable busyloop
From: Jason Wang @ 2018-06-21  5:59 UTC (permalink / raw)
  To: Tonghao Zhang; +Cc: netdev, Tonghao Zhang, virtualization
In-Reply-To: <1529501332-118823-1-git-send-email-xiangxia.m.yue@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2916 bytes --]



On 2018年06月20日 21:28, Tonghao Zhang wrote:
> This patch improves the guest receive performance from
> host. On the handle_tx side, we poll the sock receive
> queue at the same time. handle_rx do that in the same way.
>
> we set the poll-us=100 us and use the iperf3 to test
> its throughput. The iperf3 command is shown as below.
>
> iperf3 -s -D
> iperf3 -c 192.168.1.100 -i 1 -P 10 -t 10 -M 1400 --bandwidth 100000M
>
> * With the patch:    21.1 Gbits/sec
> * Without the patch: 12.7 Gbits/sec

Thanks a lot for the patch. But looks like it needs some work to avoid 
e.g deadlock.

E.g in vhost_process_iotlb_msg() we call vhost_dev_lock_vqs() which did:

     for (i = 0; i < d->nvqs; ++i)
         mutex_lock_nested(&d->vqs[i]->mutex, i);

I believe we need to change the code to lock the vq one by one like the 
attached (only compile test).

> Signed-off-by: Tonghao Zhang <zhangtonghao@didichuxing.com>
> ---
>   drivers/vhost/net.c | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index e7cf7d2..9364ede 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -429,22 +429,43 @@ static int vhost_net_enable_vq(struct vhost_net *n,
>   	return vhost_poll_start(poll, sock->file);
>   }
>   
> +static int sk_has_rx_data(struct sock *sk);
> +

How about move sk_has_rx_data() here.

>   static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
>   				    struct vhost_virtqueue *vq,
>   				    struct iovec iov[], unsigned int iov_size,
>   				    unsigned int *out_num, unsigned int *in_num)
>   {
>   	unsigned long uninitialized_var(endtime);
> +	struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_RX];
> +	struct vhost_virtqueue *rvq = &nvq->vq;
> +	struct socket *sock = rvq->private_data;
> +
>   	int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>   				  out_num, in_num, NULL, NULL);
>   
>   	if (r == vq->num && vq->busyloop_timeout) {
> +		mutex_lock_nested(&rvq->mutex, 1);
> +
> +		vhost_disable_notify(&net->dev, rvq);
> +
>   		preempt_disable();
>   		endtime = busy_clock() + vq->busyloop_timeout;
>   		while (vhost_can_busy_poll(vq->dev, endtime) &&
> +		       !(sock && sk_has_rx_data(sock->sk)) &&
>   		       vhost_vq_avail_empty(vq->dev, vq))
>   			cpu_relax();
>   		preempt_enable();
> +
> +		if (sock && sk_has_rx_data(sock->sk))
> +			vhost_poll_queue(&rvq->poll);
> +		else if (unlikely(vhost_enable_notify(&net->dev, rvq))) {
> +			vhost_disable_notify(&net->dev, rvq);
> +			vhost_poll_queue(&rvq->poll);
> +		}
> +
> +		mutex_unlock(&rvq->mutex);

Some kinds of code duplication, can we try to unify them?

Btw, net-next is closed, so you need resubmit after it was open and use 
a "net-next" as the prefix of the patch.

Thanks

> +
>   		r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>   				      out_num, in_num, NULL, NULL);
>   	}


[-- Attachment #2: 0001-vhost-lock-vqs-one-by-one.patch --]
[-- Type: text/x-patch, Size: 2179 bytes --]

>From 383fe9d98420d92a632dc554969b4b1716017ba2 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang@redhat.com>
Date: Thu, 21 Jun 2018 13:58:31 +0800
Subject: [PATCH] vhost: lock vqs one by one

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index e5bc4bb..937252d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -294,8 +294,11 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
 {
 	int i;
 
-	for (i = 0; i < d->nvqs; ++i)
+	for (i = 0; i < d->nvqs; ++i) {
+		mutex_lock(&d->vqs[i]->mutex);
 		__vhost_vq_meta_reset(d->vqs[i]);
+		mutex_unlock(&d->vqs[i]->mutex);
+	}
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -855,20 +858,6 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
 #define vhost_get_used(vq, x, ptr) \
 	vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
 
-static void vhost_dev_lock_vqs(struct vhost_dev *d)
-{
-	int i = 0;
-	for (i = 0; i < d->nvqs; ++i)
-		mutex_lock_nested(&d->vqs[i]->mutex, i);
-}
-
-static void vhost_dev_unlock_vqs(struct vhost_dev *d)
-{
-	int i = 0;
-	for (i = 0; i < d->nvqs; ++i)
-		mutex_unlock(&d->vqs[i]->mutex);
-}
-
 static int vhost_new_umem_range(struct vhost_umem *umem,
 				u64 start, u64 size, u64 end,
 				u64 userspace_addr, int perm)
@@ -918,7 +907,9 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
 		if (msg->iova <= vq_msg->iova &&
 		    msg->iova + msg->size - 1 > vq_msg->iova &&
 		    vq_msg->type == VHOST_IOTLB_MISS) {
+			mutex_lock(&node->vq->mutex);
 			vhost_poll_queue(&node->vq->poll);
+			mutex_unlock(&node->vq->mutex);
 			list_del(&node->node);
 			kfree(node);
 		}
@@ -950,7 +941,6 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 	int ret = 0;
 
 	mutex_lock(&dev->mutex);
-	vhost_dev_lock_vqs(dev);
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
 		if (!dev->iotlb) {
@@ -984,7 +974,6 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 		break;
 	}
 
-	vhost_dev_unlock_vqs(dev);
 	mutex_unlock(&dev->mutex);
 
 	return ret;
-- 
2.7.4


[-- Attachment #3: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related

* Re: [PATCH] net: macb: Fix ptp time adjustment for large negative delta
From: David Miller @ 2018-06-21  6:01 UTC (permalink / raw)
  To: harini.katakam
  Cc: nicolas.ferre, netdev, linux-kernel, michal.simek,
	harinikatakamlinux
In-Reply-To: <1529494460-4689-1-git-send-email-harini.katakam@xilinx.com>

From: Harini Katakam <harini.katakam@xilinx.com>
Date: Wed, 20 Jun 2018 17:04:20 +0530

> When delta passed to gem_ptp_adjtime is negative, the sign is
> maintained in the ns_to_timespec64 conversion. Hence timespec_add
> should be used directly. timespec_sub will just subtract the negative
> value thus increasing the time difference.
> 
> Signed-off-by: Harini Katakam <harini.katakam@xilinx.com>

Applied and queued up for -stable.

^ permalink raw reply

* Re: [PATCH] selftests: net: add tcp_inq to gitignore
From: David Miller @ 2018-06-21  6:02 UTC (permalink / raw)
  To: anders.roxell; +Cc: shuah, netdev, linux-kselftest, linux-kernel
In-Reply-To: <20180620224344.6425-1-anders.roxell@linaro.org>

From: Anders Roxell <anders.roxell@linaro.org>
Date: Thu, 21 Jun 2018 00:43:44 +0200

> sha: 702353b538f5 ("selftest: add test for TCP_INQ") forgot to add
> tcp_inq to .gitignore.
> 
> Signed-off-by: Anders Roxell <anders.roxell@linaro.org>

Applied, thank you.

^ permalink raw reply

* [PATCH net] xfrm: fix missing dst_release() after policy blocking lbcast and multicast
From: Tommi Rantala @ 2018-06-21  6:30 UTC (permalink / raw)
  To: netdev, Steffen Klassert
  Cc: huaibin Wang, Tommi Rantala, Herbert Xu, David S. Miller,
	open list

Fix missing dst_release() when local broadcast or multicast traffic is
xfrm policy blocked.

For IPv4 this results to dst leak: ip_route_output_flow() allocates
dst_entry via __ip_route_output_key() and passes it to
xfrm_lookup_route(). xfrm_lookup returns ERR_PTR(-EPERM) that is
propagated. The dst that was allocated is never released.

IPv4 local broadcast testcase:
 ping -b 192.168.1.255 &
 sleep 1
 ip xfrm policy add src 0.0.0.0/0 dst 192.168.1.255/32 dir out action block

IPv4 multicast testcase:
 ping 224.0.0.1 &
 sleep 1
 ip xfrm policy add src 0.0.0.0/0 dst 224.0.0.1/32 dir out action block

For IPv6 the missing dst_release() causes trouble e.g. when used in netns:
 ip netns add TEST
 ip netns exec TEST ip link set lo up
 ip link add dummy0 type dummy
 ip link set dev dummy0 netns TEST
 ip netns exec TEST ip addr add fd00::1111 dev dummy0
 ip netns exec TEST ip link set dummy0 up
 ip netns exec TEST ping -6 -c 5 ff02::1%dummy0 &
 sleep 1
 ip netns exec TEST ip xfrm policy add src ::/0 dst ff02::1 dir out action block
 wait
 ip netns del TEST

After netns deletion we see:
[  258.239097] unregister_netdevice: waiting for lo to become free. Usage count = 2
[  268.279061] unregister_netdevice: waiting for lo to become free. Usage count = 2
[  278.367018] unregister_netdevice: waiting for lo to become free. Usage count = 2
[  288.375259] unregister_netdevice: waiting for lo to become free. Usage count = 2

Fixes: ac37e2515c1a ("xfrm: release dst_orig in case of error in xfrm_lookup()")
Signed-off-by: Tommi Rantala <tommi.t.rantala@nokia.com>
---
 net/xfrm/xfrm_policy.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 5f48251c1319..7c5e8978aeaa 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2286,6 +2286,9 @@ struct dst_entry *xfrm_lookup_route(struct net *net, struct dst_entry *dst_orig,
 	if (IS_ERR(dst) && PTR_ERR(dst) == -EREMOTE)
 		return make_blackhole(net, dst_orig->ops->family, dst_orig);
 
+	if (IS_ERR(dst))
+		dst_release(dst_orig);
+
 	return dst;
 }
 EXPORT_SYMBOL(xfrm_lookup_route);
-- 
2.16.2

^ permalink raw reply related

* Re: [PATCH net] cls_flower: fix use after free in flower S/W path
From: Paolo Abeni @ 2018-06-21  7:21 UTC (permalink / raw)
  To: Cong Wang
  Cc: Linux Kernel Network Developers, Jamal Hadi Salim, Jiri Pirko,
	Marcelo Ricardo Leitner, Paul Blakey
In-Reply-To: <CAM_iQpVMrTHYHE+DXDGyqXzsOnM624yaX4ddyd=C9zpB9wAjzg@mail.gmail.com>

Hi,

On Wed, 2018-06-20 at 11:06 -0700, Cong Wang wrote:
> On Wed, Jun 20, 2018 at 10:34 AM, Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > +static void fl_mask_free(struct fl_flow_mask *mask)
> > +{
> > +       rhashtable_destroy(&mask->ht);
> 
> I don't believe you can call rhashtable_destroy() in BH
> context, it acquires a mutex...

Thank you for the review. You are right. Tomorrow (I'm on holiday
today) I'll look towards leverging tcf_queue_work() to schedule
fl_mask_free()/rhashtable_destroy() at a proper time. 

Cheers,

Paolo

^ permalink raw reply

* Re: [PATCH] optoe: driver to read/write SFP/QSFP EEPROMs
From: Andrew Lunn @ 2018-06-21  8:11 UTC (permalink / raw)
  To: Don Bollinger; +Cc: netdev, Florian Fainelli
In-Reply-To: <20180621052425.pa464laxjrebm5s3@thebollingers.org>

> I'm trying to figure out how the netdev environment works on large
> switches.  I can't imagine that the kernel network stack is involved in
> any part of the data plane. 

Hi Don

It is involved in the slow path. I.e. packets from the host out
network ports. BPDU, IGMP, ARP, ND, etc. It can also be involved when
the hardware is missing features. Also, for communication with the
host itself.

What is more important is it is the control plane. Want to bridge two
ports together?  You create a software bridge, add the two ports, and
then offload it to the hardware. The kernel STP code in the software
bridge then does the state tracking, blocked, learning, forwarding
etc. Need RSTP? Just run the standard RSTP daemon on the software
bridge interface.

What to add a route between two ports? Just use the ip route add. It
then it gets offloaded to the hardware. This means routing daemons
like FreeRangeRouting just work. 

Want to combine two ports to make a trunk? Use the bond/team driver,
and offload it to the hardware.

Basically, use the Linux network stack as is, and offload what you can
to the hardware. That means you keep all your existing user space
network tools, daemons, SNMP agents, etc. They all just work, because
the kernel APIs remain the same, independent of if you have a switch,
or just a bunch of networks cards.

> Can you point me to any conference slides,
> or design docs or other documentation that describes a netdev
> implementation on Trident II switch silicon?  Or any other switch that
> supports 128 x 10G (or 32 x 40G) ports?

Look at past netdev conference. In particular, talks given by
Mellanox, Cumulus, and Netronome. You can also see there drivers in
drivers/ethernet/{mellonex|netronome}. These devices however tend to
go for firmware to control the PHYs, not the Linux network stack.
drivers/net/dsa covers SOHO switches, so up to 10 ports, mostly 1G,
some 10G ports. There is a lot of industry involved thin this segment,
trains, planes, busses, plant automation, etc, and some WiFi and STP.
Switches with DSA drivers make use of Linux to control the PHYs, not
firmware.

SFP are also slowly starting to enter the consumer market, with
products like the Clearfog, MACCHIATObin, and there are a few
industrial boards with SOHO switches with SFP sockets or SFF
modules. These are what are driving in kernel SFP support.

> Also, I see the sfp.c code.  Is there any QSFP code?  I'd like to see
> how that code compares to the sfp.c code.

Currently, none that i know of. SFP+ is the limit so far. Mainly
because SoC/SOHO switches currently top out at 10G, so SFP+ is
sufficient.

> optoe can provide access, through the SFP code, to the rest of the EEPROM
> address space.  It can also provide access to the QSFP EEPROM.  I would
> like to collaborate on the integration, that would fit my objective of
> making more EEPROM space accessible on more platforms and distros.
> 
> However, you don't want me to make the changes to SFP myself.  I don't
> have any hardware or OS environment that currently supports that code.
> The cost and learning curve exceed my resources.  I *think* the changes
> to the SFP code would be small, but I would need someone who understands
> the code and can test it to actually make and submit the changes.

So i have been thinking about this some more. But given your lack of
resources, i'm guessing this won't actually work for you. But it is
probably the correct way forwards.

The basic problem the systems you are talking about is that they don't
have a network interface per port. So they cannot use ethtool
--module-info, which IS the defined API to get access to SFP
data. Adding another API is probably not going to get accepted.

However, the current ethtool API is ioctl based. The network stack is
moving away from that, replacing it with netlink sockets. All IP
configuration is now via netlink. All bridge configuration is by
netlink, etc. So there is a desire to move ethtool to netlink.

This move makes the API more flexible. By default, you would expect
the replacement implementation for --module-info to pass an ifindex,
or maybe an interface name. However, it could be made to optionally
take an i2c bus number. That could then go direct to the SFP code,
rather than go via the MAC driver. That would give evil user space,
proprietary, binary blob drivers access to SFP via the standard
supported kernel API, using the standard supported kernel SFP driver.

But that requires you roll up your sleeves and get stuck in to this
conversion work.

But you say you work for a fibre module company. Do they produce
SFP/SFP+ modules? You could get one of the supported consumer boards
with an SFP/SFP+ socket and test your modules work properly. Build out
the SFP code. It has been on my TODO list to add HWMON support for the
temperature sensors, etc.

	   Andrew

^ permalink raw reply

* [PATCH net-next v4 1/2] r8169: Don't disable ASPM in the driver
From: Kai-Heng Feng @ 2018-06-21  8:30 UTC (permalink / raw)
  To: davem
  Cc: ryankao, hayeswang, hau, hkallweit1, romieu, bhelgaas, acelan.kao,
	netdev, linux-pci, linux-kernel, Kai-Heng Feng

Enable or disable ASPM should be done in PCI core instead of in the
device driver.

Commit ba04c7c93bbc ("r8169: disable ASPM") uses
pci_disable_link_state() to disable ASPM, but it's not the best way to
do it. If the device really wants to disable ASPM, we can use a quirk in
PCI core to prevent the PCI core from setting ASPM before probe.

Let's remove pci_disable_link_state() for now. Use PCI core quirks if
any regression happens.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
v4:
- Improve commit message.
- Move the empty line to where it belongs.

v3:
- Change commit message wording.
- Rename the function to rtl_hw_aspm_clkreq_enable().

v2:
- Remove module parameter.
- Remove pci_disable_link_state().

 drivers/net/ethernet/realtek/r8169.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 75dfac0248f4..2bdfd25acd58 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -25,7 +25,6 @@
 #include <linux/dma-mapping.h>
 #include <linux/pm_runtime.h>
 #include <linux/firmware.h>
-#include <linux/pci-aspm.h>
 #include <linux/prefetch.h>
 #include <linux/ipv6.h>
 #include <net/ip6_checksum.h>
@@ -7647,11 +7646,6 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	mii->reg_num_mask = 0x1f;
 	mii->supports_gmii = cfg->has_gmii;
 
-	/* disable ASPM completely as that cause random device stop working
-	 * problems as well as full system hangs for some PCIe devices users */
-	pci_disable_link_state(pdev, PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1 |
-				     PCIE_LINK_STATE_CLKPM);
-
 	/* enable device (incl. PCI PM wakeup and hotplug setup) */
 	rc = pcim_enable_device(pdev);
 	if (rc < 0) {
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next v4 2/2] r8169: Reinstate ASPM Support
From: Kai-Heng Feng @ 2018-06-21  8:30 UTC (permalink / raw)
  To: davem
  Cc: ryankao, hayeswang, hau, hkallweit1, romieu, bhelgaas, acelan.kao,
	netdev, linux-pci, linux-kernel, Kai-Heng Feng
In-Reply-To: <20180621083039.22545-1-kai.heng.feng@canonical.com>

On Intel platforms (Skylake and newer), ASPM support in r8169 is the
last missing puzzle to let CPU's Package C-State reaches PC8.  Without
ASPM support, the CPU cannot reach beyond PC3. PC8 can save additional
~3W in comparison with PC3 on a Coffee Lake platform, Dell G3 3779.

This is based on the work from Chunhao Lin <hau@realtek.com>.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
v4:
- Improve commit message.
- Move the empty line to where it belongs.

v3:
- Change commit message wording.
- Rename the function to rtl_hw_aspm_clkreq_enable().

v2:
- Remove module parameter.
- Remove pci_disable_link_state().

 drivers/net/ethernet/realtek/r8169.c | 39 +++++++++++++++++++---------
 1 file changed, 27 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 2bdfd25acd58..269ac7561368 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -5289,6 +5289,17 @@ static void rtl_pcie_state_l2l3_enable(struct rtl8169_private *tp, bool enable)
 	RTL_W8(tp, Config3, data);
 }
 
+static void rtl_hw_aspm_clkreq_enable(struct rtl8169_private *tp, bool enable)
+{
+	if (enable) {
+		RTL_W8(tp, Config2, RTL_R8(tp, Config2) | ClkReqEn);
+		RTL_W8(tp, Config5, RTL_R8(tp, Config5) | ASPM_en);
+	} else {
+		RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
+		RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	}
+}
+
 static void rtl_hw_start_8168bb(struct rtl8169_private *tp)
 {
 	RTL_W8(tp, Config3, RTL_R8(tp, Config3) & ~Beacon_en);
@@ -5645,9 +5656,9 @@ static void rtl_hw_start_8168g_1(struct rtl8169_private *tp)
 	rtl_hw_start_8168g(tp);
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8168g_1, ARRAY_SIZE(e_info_8168g_1));
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168g_2(struct rtl8169_private *tp)
@@ -5680,9 +5691,9 @@ static void rtl_hw_start_8411_2(struct rtl8169_private *tp)
 	rtl_hw_start_8168g(tp);
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8411_2, ARRAY_SIZE(e_info_8411_2));
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
@@ -5699,8 +5710,7 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
 	};
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8168h_1, ARRAY_SIZE(e_info_8168h_1));
 
 	RTL_W32(tp, TxConfig, RTL_R32(tp, TxConfig) | TXCFG_AUTO_FIFO);
@@ -5779,6 +5789,8 @@ static void rtl_hw_start_8168h_1(struct rtl8169_private *tp)
 	r8168_mac_ocp_write(tp, 0xe63e, 0x0000);
 	r8168_mac_ocp_write(tp, 0xc094, 0x0000);
 	r8168_mac_ocp_write(tp, 0xc09e, 0x0000);
+
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168ep(struct rtl8169_private *tp)
@@ -5830,11 +5842,12 @@ static void rtl_hw_start_8168ep_1(struct rtl8169_private *tp)
 	};
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8168ep_1, ARRAY_SIZE(e_info_8168ep_1));
 
 	rtl_hw_start_8168ep(tp);
+
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168ep_2(struct rtl8169_private *tp)
@@ -5846,14 +5859,15 @@ static void rtl_hw_start_8168ep_2(struct rtl8169_private *tp)
 	};
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8168ep_2, ARRAY_SIZE(e_info_8168ep_2));
 
 	rtl_hw_start_8168ep(tp);
 
 	RTL_W8(tp, DLLPR, RTL_R8(tp, DLLPR) & ~PFM_EN);
 	RTL_W8(tp, MISC_1, RTL_R8(tp, MISC_1) & ~PFM_D3COLD_EN);
+
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168ep_3(struct rtl8169_private *tp)
@@ -5867,8 +5881,7 @@ static void rtl_hw_start_8168ep_3(struct rtl8169_private *tp)
 	};
 
 	/* disable aspm and clock request before access ephy */
-	RTL_W8(tp, Config2, RTL_R8(tp, Config2) & ~ClkReqEn);
-	RTL_W8(tp, Config5, RTL_R8(tp, Config5) & ~ASPM_en);
+	rtl_hw_aspm_clkreq_enable(tp, false);
 	rtl_ephy_init(tp, e_info_8168ep_3, ARRAY_SIZE(e_info_8168ep_3));
 
 	rtl_hw_start_8168ep(tp);
@@ -5888,6 +5901,8 @@ static void rtl_hw_start_8168ep_3(struct rtl8169_private *tp)
 	data = r8168_mac_ocp_read(tp, 0xe860);
 	data |= 0x0080;
 	r8168_mac_ocp_write(tp, 0xe860, data);
+
+	rtl_hw_aspm_clkreq_enable(tp, true);
 }
 
 static void rtl_hw_start_8168(struct rtl8169_private *tp)
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH v3 07/14] net: smc911x: remove the dmaengine compat need
From: Robert Jarzmik @ 2018-06-21  8:43 UTC (permalink / raw)
  To: David Miller; +Cc: daniel, linux-kernel, netdev
In-Reply-To: <20180621.083945.1673251234112481148.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> From: Robert Jarzmik <robert.jarzmik@free.fr>
> Date: Wed, 20 Jun 2018 19:17:43 +0200
>
>> Could you (or somebody from netdev) review it and either ack it (and I'll take
>> it through the pxa tree), or take it for v4.19 please ?
>
> Acked-by: David S. Miller <davem@davemloft.net>
Thanks David.

Queued to pxa/for-next.

Cheers.

-- 
Robert

^ permalink raw reply

* Re: [PATCH v3 08/14] net: smc91x: remove the dmaengine compat need
From: Robert Jarzmik @ 2018-06-21  8:44 UTC (permalink / raw)
  To: David Miller; +Cc: daniel, linux-kernel, netdev
In-Reply-To: <20180621.084002.1771429251320622962.davem@davemloft.net>

David Miller <davem@davemloft.net> writes:

> Acked-by: David S. Miller <davem@davemloft.net>
Thanks David.

Queued to pxa/for-next.

Cheers.

-- 
Robert

^ permalink raw reply

* Re: [GIT] Networking
From: Ingo Molnar @ 2018-06-21  8:45 UTC (permalink / raw)
  To: David Miller, Alexei Starovoitov, Stephen Rothwell
  Cc: torvalds, akpm, netdev, linux-kernel, Thomas Gleixner
In-Reply-To: <20180620.213540.214945910722071226.davem@davemloft.net>


* David Miller <davem@davemloft.net> wrote:

> 1) Fix crash on bpf_prog_load() errors, from Daniel Borkmann.

> Daniel Borkmann (4):
>       Merge branch 'bpf-misc-fixes'
>       bpf: fix panic in prog load calls cleanup
>       bpf: reject any prog that failed read-only lock
>       bpf, xdp, i40e: fix i40e_build_skb skb reserve and truesize

JFYI, I'm still seeing this BPF build error upstream, on a 32-bit allyesconfig I'm 
getting:

  LD      vmlinux.o
  ld: i386:x86-64 architecture of input file `net/bpfilter/bpfilter_umh.o' is incompatible with i386 output
  Makefile:1010: recipe for target 'vmlinux' failed
  make: *** [vmlinux] Error 1

A similar looking build bug was reported by sfr three weeks ago:

> Subject: linux-next: build failure after merge of the net-next tree
>
> ...
>
> x86_64-linux-ld: unknown architecture of input file `net/bpfilter/bpfilter_umh.o' 
> is incompatible with i386:x86-64 output
>
> Caused by commit
>
>  d2ba09c17a06 ("net: add skeleton of bpfilter kernel module")
>
> In my builds, the host is PowerPC 64 LE ...
>
> I have reverted that commit along with
>
>  61a552eb487f ("bpfilter: fix build dependency")
>  13405468f49d ("bpfilter: don't pass O_CREAT when opening console for debug")
>
> for today.

Is there a fix I could try?

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH v3 09/14] ASoC: pxa: remove the dmaengine compat need
From: Robert Jarzmik @ 2018-06-21  8:50 UTC (permalink / raw)
  To: Mark Brown, Daniel Mack
  Cc: Ulf Hansson, alsa-devel, linux-kernel, linux-ide, linux-mtd,
	Miquel Raynal, Mauro Carvalho Chehab, Vinod Koul,
	Richard Weinberger, Takashi Iwai, Marek Vasut, linux-media,
	Bartlomiej Zolnierkiewicz, Haojian Zhuang, Boris Brezillon,
	linux-arm-kernel, Nicolas Pitre, netdev, linux-mmc, Liam Girdwood,
	dmaengine, Tejun Heo, Brian Norris, David Woodhouse
In-Reply-To: <1984a4a7-e33b-6349-31f3-6f85ff1582b8@zonque.org>

Daniel Mack <daniel@zonque.org> writes:

> Hi Mark,
>
> I prepared a series of patches for 4.18 that conflict with this one. Instead of
> letting other people resolve this down the road, I'd prefer if this one went
> through the ASoC tree, if you don't mind.
>
> I've talked to Robert off-list, and he's fine with this approach.
That's right.

I'll tentativaly queue this patch to my pxa/for-next tree. As soon as I hear
that Mark queued it up in his tree, I'll drop it from mine.

Cheers.

--
Robert

^ permalink raw reply

* Re: BUG: jumbo frames broken after commit xen-netfront: Fix race between device setup and open
From: Juergen Gross @ 2018-06-21  9:05 UTC (permalink / raw)
  To: Ross Lagerwall
  Cc: Javier Martinez Canillas, Andrew Jeddeloh, Laura Abbott,
	dustymabe, Greg Kroah-Hartman, David S. Miller, stable, netdev,
	xen-devel
In-Reply-To: <CABxcv=mKA=Ueu6PmPs5OQgpU_jkG4BhDxNLwEpiiOSV5t8xzrg@mail.gmail.com>

On 14/06/18 17:43, Javier Martinez Canillas wrote:
> Hi Andrew,
> 
> On Wed, Jun 6, 2018 at 6:29 PM, Andrew Jeddeloh
> <andrew.jeddeloh@redhat.com> wrote:
>> Hi all,
>>
>> The patch "xen-netfront: Fix race between device setup and open" seems
>> to have introduced a regression preventing setting MTU's larger than
>> 1500. We experienced this downstream with Container Linux and
>> confirmed with Fedora 28 as well.
>>
>> It's commit f599c64fdf7d9c108e8717fb04bc41c680120da4 in the linux-stable tree.
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/?id=f599c64fdf7d9c108e8717fb04bc41c680120da4
>>
>> Downstream bugs:
>> https://github.com/coreos/bugs/issues/2443
>> https://bugzilla.redhat.com/show_bug.cgi?id=1584216
>>
>> We've confirmed that reverting that commit fixes the bug. It be
>> reliably can be reproduced on AWS with t2.micro instances (and
>> presumably other systems using the same driver). Both using
>> systemd-networkd to set the mtu and manual ip link commands cause the
>> link to repsond with "Invalid argument" when trying to set the MTU >
>> 1500.
>>
>> I'm not sure why that commit introduced the regression.
>>
>> Please let me know if there's any more information that would be helpful.
>>
>> - Andrew
> 
> I'm adding some relevant people to the CC list to bring more attention
> on this regression.
> 
> The get_maintainer.pl script is very useful to get some hints on who
> should be copied, i.e:
> 
> $ ./scripts/get_maintainer.pl -f drivers/net/xen-netfront.c

Ross, have you made any progress here? If not I'm thinking of reverting
your patch as I think the current problem is more severe than the one
your patch did address.


Juergen

^ permalink raw reply

* Re: [PATCH] net: Fix device name resolving crash in default_device_exit()
From: Kirill Tkhai @ 2018-06-21 10:03 UTC (permalink / raw)
  To: David Ahern, netdev
  Cc: davem, daniel, jakub.kicinski, ast, linux, john.fastabend, brouer
In-Reply-To: <6d4ded39-d85d-7497-f0fb-af3df160b18b@gmail.com>

On 20.06.2018 20:15, David Ahern wrote:
> On 6/20/18 2:57 AM, Kirill Tkhai wrote:
>> From: Kirill Tkhai <ktkhai@virtuozzo.com>
>>
>> The following script makes kernel to crash since it can't obtain
>> a name for a device, when the name is occupied by another device:
>>
>> #!/bin/bash
>> ifconfig eth0 down
>> ifconfig eth1 down
>> index=`cat /sys/class/net/eth1/ifindex`
>> ip link set eth1 name dev$index
>> unshare -n sleep 1h &
>> pid=$!
>> while [[ "`readlink /proc/self/ns/net`" == "`readlink /proc/$pid/ns/net`" ]]; do continue; done
>> ip link set dev$index netns $pid
>> ip link set eth0 name dev$index
>> kill -9 $pid
>>
>> Kernel messages:
>>
>> virtio_net virtio1 dev3: renamed from eth1
>> virtio_net virtio0 dev3: renamed from eth0
>> default_device_exit: failed to move dev3 to init_net: -17
>> ------------[ cut here ]------------
>> kernel BUG at net/core/dev.c:8978!
>> invalid opcode: 0000 [#1] PREEMPT SMP
>> CPU: 1 PID: 276 Comm: kworker/u8:3 Not tainted 4.17.0+ #292
>> Workqueue: netns cleanup_net
>> RIP: 0010:default_device_exit+0x9c/0xb0
>> [stack trace snipped]
>>
>> This patch gives more variability during choosing new name
>> of device and fixes the problem.
>>
>> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
>> ---
>>
>> Since there is no suggestions how to fix this in another way, I'm resending the patch.
> 
> This patch does not remove the BUG, so does not really solve the
> problem. ie., it is fairly trivial to write a script (32k dev%d named
> devices in init_net) that triggers it again, so your commit subject and
> commit log are not correct with the references to 'fixing the problem'.

1)I'm not agree with you and I don't think removing the BUG() is a good idea.
This function is called from the place, where it must not fail. But it can
fail, and the problem with name is not the only reason of this happens.
We can't continue further pernet_operations in case of a problem happened
in default_device_exit(), and we can't remove the BUG() before this function
becomes of void type. But we are not going to make it of void type. So
we can't remove the BUG().

2)In case of the script is trivial, can't you just post it here to show
what type of devices you mean? Is there real problem or this is
a theoretical thinking?

All virtual devices I see have rtnl_link_ops, so that they just destroyed
in default_device_exit_batch(). According to physical devices, it's difficult
to imagine a node with 32k physical devices, and if someone tried to deploy
them it may meet problems not only in this place.

> The change does provide more variability in naming and reduces the
> likelihood of not being able to push a device back to init_net.

No, it provides. With the patch one may move real device to a container,
and allow to do with the device anything including changing of device
index. Then, the destruction of the container does not resilt a kernel
panic just because of two devices have the same index.

Kirill

^ permalink raw reply

* Re: [PATCH bpf-next 3/3] bpf: btf: json print map dump with btf info
From: Okash Khawaja @ 2018-06-21 10:05 UTC (permalink / raw)
  To: Song Liu
  Cc: Daniel Borkmann, Martin Lau, Alexei Starovoitov, Yonghong Song,
	Quentin Monnet, Jakub Kicinski, David S. Miller,
	netdev@vger.kernel.org, Kernel Team, linux-kernel@vger.kernel.org
In-Reply-To: <60109415-AE62-4E4A-BF7D-401F66245523@fb.com>

On Thu, Jun 21, 2018 at 12:22:52AM +0100, Song Liu wrote:
> 
> 
> > On Jun 20, 2018, at 1:30 PM, Okash Khawaja <osk@fb.com> wrote:
> > 
> > This patch modifies `bpftool map dump [-j|-p] id <map-id>` to json-
> > print and pretty-json-print map dump. It calls btf_dumper introduced in
> > previous patch to accomplish this.
> > 
> > The patch only prints debug info when -j or -p flags are supplied. Then
> > too, if the map has associated btf data loaded. Otherwise the usual
> > debug-less output is printed.
> > 
> > Signed-off-by: Okash Khawaja <osk@fb.com>
> > Acked-by: Martin KaFai Lau <kafai@fb.com>
> > 
> > ---
> > tools/bpf/bpftool/map.c |   94 ++++++++++++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 91 insertions(+), 3 deletions(-)
> > 
> > --- a/tools/bpf/bpftool/map.c
> > +++ b/tools/bpf/bpftool/map.c
> > @@ -43,9 +43,13 @@
> > #include <unistd.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > +#include <linux/err.h>
> > 
> > #include <bpf.h>
> > 
> > +#include "json_writer.h"
> > +#include "btf.h"
> > +#include "btf_dumper.h"
> > #include "main.h"
> > 
> > static const char * const map_type_name[] = {
> > @@ -508,6 +512,83 @@ static int do_show(int argc, char **argv
> > 	return errno == ENOENT ? 0 : -1;
> > }
> > 
> > +
> > +static int do_dump_btf(struct btf *btf, struct bpf_map_info *map_info,
> > +		void *key, void *value)
> > +{
> > +	int ret;
> > +
> > +	jsonw_start_object(json_wtr);
> > +	jsonw_name(json_wtr, "key");
> > +
> > +	ret = btf_dumper_type(btf, json_wtr, map_info->btf_key_type_id, key);
> > +	if (ret)
> > +		goto out;
> > +
> > +	jsonw_end_object(json_wtr);
> > +
> > +	jsonw_start_object(json_wtr);
> > +	jsonw_name(json_wtr, "value");
> > +
> > +	ret = btf_dumper_type(btf, json_wtr, map_info->btf_value_type_id,
> > +			value);
> > +
> > +out:
> > +	/* end of root object */
> > +	jsonw_end_object(json_wtr);
> > +
> > +	return ret;
> > +}
> 
> Please add some tests for do_dump_btf(), including some invalid data type/kind.
Sure. I was wondering if we can follow this up with tests afterwards.
However if they are essential now, I can add them.

> 
> > +
> > +static struct btf *get_btf(struct bpf_map_info *map_info)
> > +{
> > +	int btf_fd = bpf_btf_get_fd_by_id(map_info->btf_id);
> > +	struct bpf_btf_info btf_info = { 0 };
> > +	__u32 len = sizeof(btf_info);
> > +	uint32_t last_size;
> > +	int err;
> > +	struct btf *btf = NULL;
> > +	void *ptr = NULL, *temp_ptr;
> > +
> > +	if (btf_fd < 0)
> > +		return NULL;
> > +
> > +	btf_info.btf_size = 4096;
> 
> Is btf_size always a constant of 4096? 
We start out with 4096 and in the loop below we increase the size if
needed. It seems like a sane valeu to start with. However if there are
better ideas for start value, we can discuss them.

> 
> > +	do {
> > +		last_size = btf_info.btf_size;
> > +		temp_ptr = realloc(ptr, last_size);
> > +		if (!temp_ptr) {
> > +			p_err("unable allocate memory for debug info.");
> > +			goto exit_free;
> > +		}
> > +
> > +		ptr = temp_ptr;
> > +		bzero(ptr, last_size);
> > +		btf_info.btf = ptr_to_u64(ptr);
> > +		err = bpf_obj_get_info_by_fd(btf_fd, &btf_info, &len);
> > +	} while (!err && btf_info.btf_size > last_size && last_size == 4096);
> > +
> > +	if (err || btf_info.btf_size > last_size) {
> > +		p_info("can't get btf info. debug info won't be displayed. error: %s",
> > +				err ? strerror(errno) : "exceeds size retry");
> > +		goto exit_free;
> > +	}
> > +
> > +	btf = btf__new((uint8_t *) btf_info.btf,
> > +			btf_info.btf_size, NULL);
> > +	if (IS_ERR(btf)) {
> > +		printf("error when initialising btf: %s\n",
> > +				strerror(PTR_ERR(btf)));
> > +		btf = NULL;
> > +	}
> > +
> > +exit_free:
> > +	close(btf_fd);
> > +	free(ptr);
> > +
> > +	return btf;
> > +}
> > +
> > static int do_dump(int argc, char **argv)
> > {
> > 	void *key, *value, *prev_key;
> > @@ -516,6 +597,7 @@ static int do_dump(int argc, char **argv
> > 	__u32 len = sizeof(info);
> > 	int err;
> > 	int fd;
> > +	struct btf *btf = NULL;
> > 
> > 	if (argc != 2)
> > 		usage();
> > @@ -538,6 +620,8 @@ static int do_dump(int argc, char **argv
> > 		goto exit_free;
> > 	}
> > 
> > +	btf = get_btf(&info);
> > +
> > 	prev_key = NULL;
> > 	if (json_output)
> > 		jsonw_start_array(json_wtr);
> > @@ -550,9 +634,12 @@ static int do_dump(int argc, char **argv
> > 		}
> > 
> > 		if (!bpf_map_lookup_elem(fd, key, value)) {
> > -			if (json_output)
> > -				print_entry_json(&info, key, value);
> > -			else
> > +			if (json_output) {
> > +				if (btf)
> > +					do_dump_btf(btf, &info, key, value);
> > +				else
> > +					print_entry_json(&info, key, value);
> > +			} else
> > 				print_entry_plain(&info, key, value);
> > 		} else {
> > 			if (json_output) {
> > @@ -584,6 +671,7 @@ exit_free:
> > 	free(key);
> > 	free(value);
> > 	close(fd);
> > +	btf__free(btf);
> > 
> > 	return err;
> > }
> > 
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox