* Re: [PATCH net-next v2 1/3] net: phy: xgmiitorgmii: Check phy_driver ready before accessing
From: Florian Fainelli @ 2018-06-26 21:59 UTC (permalink / raw)
To: Brandon Maier, netdev
Cc: andrew, davem, michal.simek, clayton.shotwell, kristopher.cory,
linux-kernel
In-Reply-To: <20180626175050.71165-1-brandon.maier@rockwellcollins.com>
On 06/26/2018 10:50 AM, Brandon Maier wrote:
> Since a phy_device is added to the global mdio_bus list during
> phy_device_register(), but a phy_device's phy_driver doesn't get
> attached until phy_probe(). It's possible of_phy_find_device() in
> xgmiitorgmii will return a valid phy with a NULL phy_driver. Leading to
> a NULL pointer access during the memcpy().
>
> Fixes this Oops:
>
> Unable to handle kernel NULL pointer dereference at virtual address 00000000
> pgd = c0004000
> [00000000] *pgd=00000000
> Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.40 #1
> Hardware name: Xilinx Zynq Platform
> task: ce4c8d00 task.stack: ce4ca000
> PC is at memcpy+0x48/0x330
> LR is at xgmiitorgmii_probe+0x90/0xe8
> pc : [<c074bc68>] lr : [<c0529548>] psr: 20000013
> sp : ce4cbb54 ip : 00000000 fp : ce4cbb8c
> r10: 00000000 r9 : 00000000 r8 : c0c49178
> r7 : 00000000 r6 : cdc14718 r5 : ce762800 r4 : cdc14710
> r3 : 00000000 r2 : 00000054 r1 : 00000000 r0 : cdc14718
> Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
> Control: 18c5387d Table: 0000404a DAC: 00000051
> Process swapper/0 (pid: 1, stack limit = 0xce4ca210)
> ...
> [<c074bc68>] (memcpy) from [<c0529548>] (xgmiitorgmii_probe+0x90/0xe8)
> [<c0529548>] (xgmiitorgmii_probe) from [<c0526a94>] (mdio_probe+0x28/0x34)
> [<c0526a94>] (mdio_probe) from [<c04db98c>] (driver_probe_device+0x254/0x414)
> [<c04db98c>] (driver_probe_device) from [<c04dbd58>] (__device_attach_driver+0xac/0x10c)
> [<c04dbd58>] (__device_attach_driver) from [<c04d96f4>] (bus_for_each_drv+0x84/0xc8)
> [<c04d96f4>] (bus_for_each_drv) from [<c04db5bc>] (__device_attach+0xd0/0x134)
> [<c04db5bc>] (__device_attach) from [<c04dbdd4>] (device_initial_probe+0x1c/0x20)
> [<c04dbdd4>] (device_initial_probe) from [<c04da8fc>] (bus_probe_device+0x98/0xa0)
> [<c04da8fc>] (bus_probe_device) from [<c04d8660>] (device_add+0x43c/0x5d0)
> [<c04d8660>] (device_add) from [<c0526cb8>] (mdio_device_register+0x34/0x80)
> [<c0526cb8>] (mdio_device_register) from [<c0580b48>] (of_mdiobus_register+0x170/0x30c)
> [<c0580b48>] (of_mdiobus_register) from [<c05349c4>] (macb_probe+0x710/0xc00)
> [<c05349c4>] (macb_probe) from [<c04dd700>] (platform_drv_probe+0x44/0x80)
> [<c04dd700>] (platform_drv_probe) from [<c04db98c>] (driver_probe_device+0x254/0x414)
> [<c04db98c>] (driver_probe_device) from [<c04dbc58>] (__driver_attach+0x10c/0x118)
> [<c04dbc58>] (__driver_attach) from [<c04d9600>] (bus_for_each_dev+0x8c/0xd0)
> [<c04d9600>] (bus_for_each_dev) from [<c04db1fc>] (driver_attach+0x2c/0x30)
> [<c04db1fc>] (driver_attach) from [<c04daa98>] (bus_add_driver+0x50/0x260)
> [<c04daa98>] (bus_add_driver) from [<c04dc440>] (driver_register+0x88/0x108)
> [<c04dc440>] (driver_register) from [<c04dd6b4>] (__platform_driver_register+0x50/0x58)
> [<c04dd6b4>] (__platform_driver_register) from [<c0b31248>] (macb_driver_init+0x24/0x28)
> [<c0b31248>] (macb_driver_init) from [<c010203c>] (do_one_initcall+0x60/0x1a4)
> [<c010203c>] (do_one_initcall) from [<c0b00f78>] (kernel_init_freeable+0x15c/0x1f8)
> [<c0b00f78>] (kernel_init_freeable) from [<c0763d10>] (kernel_init+0x18/0x124)
> [<c0763d10>] (kernel_init) from [<c0112d74>] (ret_from_fork+0x14/0x20)
> Code: ba000002 f5d1f03c f5d1f05c f5d1f07c (e8b151f8)
> ---[ end trace 3e4ec21905820a1f ]---
>
> Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
--
Florian
^ permalink raw reply
* Re: [PATCH net-next] net: preserve sock reference when scrubbing the skb.
From: Cong Wang @ 2018-06-26 21:48 UTC (permalink / raw)
To: Eric Dumazet
Cc: Flavio Leitner, Linux Kernel Network Developers, Paolo Abeni,
David Miller, Florian Westphal, NetFilter
In-Reply-To: <48e15faf-f935-0166-e1db-18f7286e7264@gmail.com>
On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 06/25/2018 09:15 PM, Cong Wang wrote:
> > On Mon, Jun 25, 2018 at 8:59 AM Flavio Leitner <fbl@redhat.com> wrote:
> >>
> >> The sock reference is lost when scrubbing the packet and that breaks
> >> TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
> >> performance impacts of about 50% in a single TCP stream when crossing
> >> network namespaces.
> >>
> >> XPS breaks because the queue mapping stored in the socket is not
> >> available, so another random queue might be selected when the stack
> >> needs to transmit something like a TCP ACK, or TCP Retransmissions.
> >> That causes packet re-ordering and/or performance issues.
> >>
> >> TSQ breaks because it orphans the packet while it is still in the
> >> host, so packets are queued contributing to the buffer bloat problem.
> >
> > Why should TSQ in one stack care about buffer bloat in another stack?
> >
> > Actually, I think the current behavior is correct, once the packet leaves
> > its current stack (or netns), it should relief the backpressure on TCP
> > socket in this stack, whether it will be queued in another stack is beyond
> > its concern. This breaks the isolation between networking stacks.
> >
>
> We discussed about this during netconf Cong, nobody was against this planned removal.
I agreed to keep skb->sk, but didn't realize it actually impacts TSQ too.
>
> When a packet is attached to a socket, we should keep the association as much as possible.
As much as possible within one stack, I agree. I still don't understand
why we should keep it across the stack boundary.
>
> Only when a new association needs to be done, skb_orphan() needs to be called.
>
> Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> can evades SO_SNDBUF limits.
Right before leaving the stack is not too soon, it is the latest
actually, for veth case.
^ permalink raw reply
* Re: [PATCH] selftests: bpf: config: add config fragments
From: William Tu @ 2018-06-26 21:27 UTC (permalink / raw)
To: Anders Roxell
Cc: Daniel Borkmann, Alexei Starovoitov, Shuah Khan, Networking,
Linux Kernel Mailing List, open list:KERNEL SELFTEST FRAMEWORK
In-Reply-To: <CADYN=9JAOim7CvoZzWnbHNEF+4yG3Lzvry1rZLqGp+7hQxPdkw@mail.gmail.com>
>> >
>> > --- ::11 ping statistics ---
>> > 5 packets transmitted, 3 packets received, 40% packet loss
>> > round-trip min/avg/max = 0.139/1.857/5.293 ms
>> > + ip netns exec at_ns0 ping -c 3 -w 10 -q 10.1.1.200
>> > PING 10.1.1.200 (10.1.1.200): 56 data bytes
>> >
>> > --- 10.1.1.200 ping statistics ---
>> > 3 packets transmitted, 3 packets received, 0% packet loss
>> > round-trip min/avg/max = 0.214/0.256/0.305 ms
>> > + ping -c 3 -w 10 -q 10.1.1.100
>> > PING 10.1.1.100 (10.1.1.100): 56 data bytes
>> >
>> > --- 10.1.1.100 ping statistics ---
>> > 3 packets transmitted, 3 packets received, 0% packet loss
>> > round-trip min/avg/max = 0.210/0.211/0.213 ms
>> > + check_err 0
>> > + '[' 0 -eq 0 ']'
>> > + ret=0
So looks like the ipv4 over ipv6 gre passes.
>> > + ip netns exec at_ns0 ping6 -c 3 -w 10 -q fc80::200
>> > PING fc80::200 (fc80::200): 56 data bytes
>> >
>> > --- fc80::200 ping statistics ---
>> > 10 packets transmitted, 0 packets received, 100% packet loss
`
but the ipv6 over ipv6 gre fails.
Do you have any firewall rules that block this traffic?
or if possible, the packet might get dropped at function ip6gre_xmit_ipv6
can you print the return value of this function?
>> > + check_err 1
>> > + '[' 0 -eq 0 ']'
>> > + ret=1
>> > + ip -s link show ip6gretap11
>> > 19: ip6gretap11@NONE: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1434 qdisc
>> > pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
>> > link/ether de:d2:0c:53:80:8c brd ff:ff:ff:ff:ff:ff
>> > RX: bytes packets errors dropped overrun mcast
>> > 2096 25 0 0 0 0
>> > TX: bytes packets errors dropped carrier collsns
>> > 5324 36 5 5 0 0
>>
>> So there are 5 errors at TX.
>
> and today when I tried it on next-20180620 I saw 8 errors at TX.
>
>> I couldn't reproduce in my local machine using 4.17-rc6.
>> How do I checkin the "next-20180613" source code?
>
> You can find the source code here [1], and I would look in the latest tag that I
> said that I was able to reproduce it on above.
>
> Cheers,
> Anders
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/
Hi Anders,
I'm still not able to reproduce the issue on next-20180620.
Below is my test.
Testing IP6GRETAP tunnel...
PING ::11(::11) 56 data bytes
--- ::11 ping statistics ---
5 packets transmitted, 3 received, 40% packet loss, time 4048ms
rtt min/avg/max/mdev = 0.040/32.118/96.235/45.337 ms
PING 10.1.1.200 (10.1.1.200) 56(84) bytes of data.
--- 10.1.1.200 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2026ms
rtt min/avg/max/mdev = 0.074/0.099/0.117/0.018 ms
PING 10.1.1.100 (10.1.1.100) 56(84) bytes of data.
--- 10.1.1.100 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2054ms
rtt min/avg/max/mdev = 0.069/0.113/0.187/0.052 ms
PING fc80::200(fc80::200) 56 data bytes
--- fc80::200 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2054ms
rtt min/avg/max/mdev = 0.069/0.104/0.142/0.031 ms
PASS: ip6gretap
root@osboxes:~/linux-next/tools/testing/selftests/bpf#
root@osboxes:~/linux-next/tools/testing/selftests/bpf# uname -a
Linux osboxes 4.18.0-rc1-next-20180620 #1 SMP Tue Jun 26 12:26:00 PDT
2018 x86_64 x86_64 x86_64 GNU/Linux
^ permalink raw reply
* Re: [PATCH v3 net-next 3/4] netdevsim: add ipsec offload testing
From: Jakub Kicinski @ 2018-06-26 21:23 UTC (permalink / raw)
To: Shannon Nelson; +Cc: davem, netdev, anders.roxell, linux-kselftest
In-Reply-To: <1530032875-30482-4-git-send-email-shannon.nelson@oracle.com>
On Tue, 26 Jun 2018 10:07:54 -0700, Shannon Nelson wrote:
> Implement the IPsec/XFRM offload API for testing.
>
> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
> ---
> V2 - addressed formatting comments from Jakub Kicinski
> V3 - a couple more little xmas tree nits
Thank you! :)
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
^ permalink raw reply
* [PATCH bpf-next] selftests/bpf: Test sys_connect BPF hooks with TFO
From: Andrey Ignatov @ 2018-06-26 21:22 UTC (permalink / raw)
To: netdev; +Cc: Andrey Ignatov, ast, daniel, kernel-team
TCP Fast Open is triggered by sys_sendmsg with MSG_FASTOPEN flag for
SOCK_STREAM socket.
Even though it's sys_sendmsg, it eventually calls __inet_stream_connect
the same way sys_connect does for TCP. __inet_stream_connect, in turn,
already has BPF hooks for sys_connect.
That means TFO is already covered by BPF_CGROUP_INET{4,6}_CONNECT and
the only missing piece is selftest. The patch adds selftest for TFO.
Signed-off-by: Andrey Ignatov <rdna@fb.com>
---
tools/testing/selftests/bpf/test_sock_addr.c | 37 ++++++++++++++++----
1 file changed, 31 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/bpf/test_sock_addr.c b/tools/testing/selftests/bpf/test_sock_addr.c
index a5e76b9219b9..2e45c92d1111 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -998,8 +998,9 @@ int init_pktinfo(int domain, struct cmsghdr *cmsg)
return 0;
}
-static int sendmsg_to_server(const struct sockaddr_storage *addr,
- socklen_t addr_len, int set_cmsg, int *syscall_err)
+static int sendmsg_to_server(int type, const struct sockaddr_storage *addr,
+ socklen_t addr_len, int set_cmsg, int flags,
+ int *syscall_err)
{
union {
char buf[CMSG_SPACE(sizeof(struct in6_pktinfo))];
@@ -1022,7 +1023,7 @@ static int sendmsg_to_server(const struct sockaddr_storage *addr,
goto err;
}
- fd = socket(domain, SOCK_DGRAM, 0);
+ fd = socket(domain, type, 0);
if (fd == -1) {
log_err("Failed to create client socket");
goto err;
@@ -1052,7 +1053,7 @@ static int sendmsg_to_server(const struct sockaddr_storage *addr,
}
}
- if (sendmsg(fd, &hdr, 0) != sizeof(data)) {
+ if (sendmsg(fd, &hdr, flags) != sizeof(data)) {
log_err("Fail to send message to server");
*syscall_err = errno;
goto err;
@@ -1066,6 +1067,15 @@ static int sendmsg_to_server(const struct sockaddr_storage *addr,
return fd;
}
+static int fastconnect_to_server(const struct sockaddr_storage *addr,
+ socklen_t addr_len)
+{
+ int sendmsg_err;
+
+ return sendmsg_to_server(SOCK_STREAM, addr, addr_len, /*set_cmsg*/0,
+ MSG_FASTOPEN, &sendmsg_err);
+}
+
static int recvmsg_from_client(int sockfd, struct sockaddr_storage *src_addr)
{
struct timeval tv;
@@ -1185,6 +1195,20 @@ static int run_connect_test_case(const struct sock_addr_test *test)
if (cmp_local_ip(clientfd, &expected_src_addr))
goto err;
+ if (test->type == SOCK_STREAM) {
+ /* Test TCP Fast Open scenario */
+ clientfd = fastconnect_to_server(&requested_addr, addr_len);
+ if (clientfd == -1)
+ goto err;
+
+ /* Make sure src and dst addrs were overridden properly */
+ if (cmp_peer_addr(clientfd, &expected_addr))
+ goto err;
+
+ if (cmp_local_ip(clientfd, &expected_src_addr))
+ goto err;
+ }
+
goto out;
err:
err = -1;
@@ -1222,8 +1246,9 @@ static int run_sendmsg_test_case(const struct sock_addr_test *test)
if (clientfd >= 0)
close(clientfd);
- clientfd = sendmsg_to_server(&requested_addr, addr_len,
- set_cmsg, &err);
+ clientfd = sendmsg_to_server(test->type, &requested_addr,
+ addr_len, set_cmsg, /*flags*/0,
+ &err);
if (err)
goto out;
else if (clientfd == -1)
--
2.17.1
^ permalink raw reply related
* [PATCH ipsec-next 1/1] xfrm: don't check offload_handle for nonzero
From: Shannon Nelson @ 2018-06-26 21:19 UTC (permalink / raw)
To: steffen.klassert; +Cc: netdev
The offload_handle should be an opaque data cookie for the driver
to use, much like the data cookie for a timer or alarm callback.
Thus, the XFRM stack should not be checking for non-zero, because
the driver might use that to store an array reference, which could
be zero, or some other zero but meaningful value.
We can remove the checks for non-zero because there are plenty
other attributes also being checked to see if there is an offload
in place for the SA in question.
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
---
net/ipv4/esp4_offload.c | 6 ++----
net/ipv6/esp6_offload.c | 6 ++----
net/xfrm/xfrm_device.c | 6 +++---
3 files changed, 7 insertions(+), 11 deletions(-)
diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index bbeecd1..58834a1 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -135,8 +135,7 @@ static struct sk_buff *esp4_gso_segment(struct sk_buff *skb,
skb->encap_hdr_csum = 1;
- if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
- (x->xso.dev != skb->dev))
+ if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev)
esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK);
else if (!(features & NETIF_F_HW_ESP_TX_CSUM))
esp_features = features & ~NETIF_F_CSUM_MASK;
@@ -179,8 +178,7 @@ static int esp_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features_
if (!xo)
return -EINVAL;
- if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
- (x->xso.dev != skb->dev)) {
+ if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev) {
xo->flags |= CRYPTO_FALLBACK;
hw_offload = false;
}
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index ddfa533..6177e21 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -162,8 +162,7 @@ static struct sk_buff *esp6_gso_segment(struct sk_buff *skb,
skb->encap_hdr_csum = 1;
- if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
- (x->xso.dev != skb->dev))
+ if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev)
esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK);
else if (!(features & NETIF_F_HW_ESP_TX_CSUM))
esp_features = features & ~NETIF_F_CSUM_MASK;
@@ -207,8 +206,7 @@ static int esp6_xmit(struct xfrm_state *x, struct sk_buff *skb, netdev_features
if (!xo)
return -EINVAL;
- if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
- (x->xso.dev != skb->dev)) {
+ if (!(features & NETIF_F_HW_ESP) || x->xso.dev != skb->dev) {
xo->flags |= CRYPTO_FALLBACK;
hw_offload = false;
}
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 175941e..9265dd6 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -56,7 +56,7 @@ struct sk_buff *validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t featur
if (skb_is_gso(skb)) {
struct net_device *dev = skb->dev;
- if (unlikely(!x->xso.offload_handle || (x->xso.dev != dev))) {
+ if (unlikely(x->xso.dev != dev)) {
struct sk_buff *segs;
/* Packet got rerouted, fixup features and segment it. */
@@ -210,8 +210,8 @@ bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x)
if (!x->type_offload || x->encap)
return false;
- if ((!dev || (x->xso.offload_handle && (dev == xfrm_dst_path(dst)->dev))) &&
- (!xdst->child->xfrm && x->type->get_mtu)) {
+ if ((!dev || (dev == xfrm_dst_path(dst)->dev)) &&
+ (!xdst->child->xfrm && x->type->get_mtu)) {
mtu = x->type->get_mtu(x, xdst->child_mtu_cached);
if (skb->len <= mtu)
--
2.7.4
^ permalink raw reply related
* Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw
From: Jakub Kicinski @ 2018-06-26 21:18 UTC (permalink / raw)
To: Jiri Pirko
Cc: Linux Netdev List, David Miller, Jamal Hadi Salim, Cong Wang,
Simon Horman, John Hurley, David Ahern, mlxsw
In-Reply-To: <20180626071217.GR2161@nanopsycho>
On Tue, 26 Jun 2018 09:12:17 +0200, Jiri Pirko wrote:
> Tue, Jun 26, 2018 at 09:00:45AM CEST, jakub.kicinski@netronome.com wrote:
> >On Mon, Jun 25, 2018 at 11:43 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> >> Tue, Jun 26, 2018 at 06:58:50AM CEST, jakub.kicinski@netronome.com wrote:
> >>>On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote:
> >>>> From: Jiri Pirko <jiri@mellanox.com>
> >>>>
> >>>> For the TC clsact offload these days, some of HW drivers need
> >>>> to hold a magic ball. The reason is, with the first inserted rule inside
> >>>> HW they need to guess what fields will be used for the matching. If
> >>>> later on this guess proves to be wrong and user adds a filter with a
> >>>> different field to match, there's a problem. Mlxsw resolves it now with
> >>>> couple of patterns. Those try to cover as many match fields as possible.
> >>>> This aproach is far from optimal, both performance-wise and scale-wise.
> >>>> Also, there is a combination of filters that in certain order won't
> >>>> succeed.
> >>>>
> >>>> Most of the time, when user inserts filters in chain, he knows right away
> >>>> how the filters are going to look like - what type and option will they
> >>>> have. For example, he knows that he will only insert filters of type
> >>>> flower matching destination IP address. He can specify a template that
> >>>> would cover all the filters in the chain.
> >>>
> >>>Perhaps it's lack of sleep, but this paragraph threw me a little off
> >>>the track. IIUC the goal of this set is to provide a way to inform the
> >>>HW about expected matches before any rule is programmed into the HW.
> >>>Not before any rule is added to a particular chain. One can just use
> >>>the first rule in the chain to make a guess about the chain, but thanks
> >>>to this set user can configure *all* chains before any rules are added.
> >>
> >> The template is per-chain. User can use template for chain x and
> >> not-use it for chain y. Up to him.
> >
> >Makes sense.
> >
> >I can't help but wonder if it'd be better to associate the
> >constraints/rules with chains instead of creating a new "template"
> >object. It seems more natural to create a chain with specific
> >constraints in place than add and delete template of which there can
> >be at most one to a chain... Perhaps that's more about the user space
> >tc command line. Anyway, not a strong objection, just a thought.
>
> Hmm. I don't think it is good idea. User should see the template in a
> "show" command per chain. We would have to have 2 show commands, one to
> list the template objects and one to list templates per chains. It makes
> things more complicated for no good reason. I think that this simple
> chain-lock is easier and serves the purpose.
Hm, I think the dump is fine, what I was thinking about was:
# tc chain add dev dummy0 ingress chain_index 22 \
^^^^^
template proto ip \
^^^^^^^^
flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
instead of:
# tc filter template add dev dummy0 ingress \
^^^^^^^^^^^^^^^
proto ip chain_index 22 \
flower dst_mac 00:00:00:00:00:00/00:00:00:00:FF:FF
And then delete becomes:
# tc chain del dev dummy0 ingress chain_index 22
Error: The chain is not empty.
The fact that template is very much like a filter is sort of an
implementation detail, from user perspective it may be more intuitive
to model template as an attribute of the chain, not a filter object
added to a chain.
But I could well be the only person who feels that way :)
^ permalink raw reply
* Re: [PATCH bpf-next 7/7] nfp: bpf: migrate to advanced reciprocal divide in reciprocal_div.h
From: Jakub Kicinski @ 2018-06-26 20:59 UTC (permalink / raw)
To: Jiong Wang; +Cc: alexei.starovoitov, daniel, oss-drivers, netdev
In-Reply-To: <20180625035421.2991-8-jakub.kicinski@netronome.com>
On Sun, 24 Jun 2018 20:54:21 -0700, Jakub Kicinski wrote:
> + * NOTE: because we are using "reciprocal_value_adv" which doesn't
> + * support dividend with MSB set, so we need to JIT separate NFP
> + * sequence to handle such case. It could be a simple sequence if there
> + * is conditional move, however there isn't for NFP. So, we don't bother
> + * generating compare-if-set-branch sequence by rejecting the program
> + * straight away when the u32 dividend has MSB set. Divide by such a
> + * large constant would be rare in practice. Also, the programmer could
> + * simply rewrite it as "result = divisor >= the_const".
Thinking about this again, can we just use carry bit? The code may end
up shorter than the explanation why we don't support that case :P
immed[c, 0]
alu[--, a, -, b]
alu[c, c, +carry, 0]
Should be equivalent to:
c = a >= b
(Thanks to Edwin for double-checking the carry semantics.)
^ permalink raw reply
* Re: [PATCH bpf-next 2/7] lib: reciprocal_div: implement the improved algorithm on the paper mentioned
From: Jakub Kicinski @ 2018-06-26 20:52 UTC (permalink / raw)
To: Song Liu
Cc: Alexei Starovoitov, Daniel Borkmann, oss-drivers, Networking,
Jiong Wang
In-Reply-To: <CAPhsuW4P5J1ZgUVKGfwWs7vX_7thc5cr1x6mrY8Wx3d2dqEDTw@mail.gmail.com>
On Mon, 25 Jun 2018 23:21:10 -0700, Song Liu wrote:
> > +struct reciprocal_value_adv reciprocal_value_adv(u32 d, u8 prec)
> > +{
> > + struct reciprocal_value_adv R;
> > + u32 l, post_shift;
> > + u64 mhigh, mlow;
> > +
> > + l = fls(d - 1);
> > + post_shift = l;
> > + /* NOTE: mlow/mhigh could overflow u64 when l == 32 which means d has
> > + * MSB set. This case needs to be handled before calling
> > + * "reciprocal_value_adv", please see the comment at
> > + * include/linux/reciprocal_div.h.
> > + */
>
> Shall we handle l == 32 case better? I guess the concern here is extra
> handling may slow down the fast path? If that's the case, we should
> at least add a WARNING on the slow path.
Agreed, I think Jiong is travelling, hence no response. We'll respin.
^ permalink raw reply
* Re: [RFC PATCH v2 net-next 00/12] Handle multiple received packets at each stage
From: Tom Herbert @ 2018-06-26 20:48 UTC (permalink / raw)
To: Edward Cree
Cc: linux-net-drivers, Linux Kernel Network Developers,
David S. Miller
In-Reply-To: <fa3d7e58-e7b6-ad0c-619f-824c25ed0d97@solarflare.com>
On Tue, Jun 26, 2018 at 11:15 AM, Edward Cree <ecree@solarflare.com> wrote:
>
> This patch series adds the capability for the network stack to receive a
> list of packets and process them as a unit, rather than handling each
> packet singly in sequence. This is done by factoring out the existing
> datapath code at each layer and wrapping it in list handling code.
>
> The motivation for this change is twofold:
> * Instruction cache locality. Currently, running the entire network
> stack receive path on a packet involves more code than will fit in the
> lowest-level icache, meaning that when the next packet is handled, the
> code has to be reloaded from more distant caches. By handling packets
> in "row-major order", we ensure that the code at each layer is hot for
> most of the list. (There is a corresponding downside in _data_ cache
> locality, since we are now touching every packet at every layer, but in
> practice there is easily enough room in dcache to hold one cacheline of
> each of the 64 packets in a NAPI poll.)
> * Reduction of indirect calls. Owing to Spectre mitigations, indirect
> function calls are now more expensive than ever; they are also heavily
> used in the network stack's architecture (see [1]). By replacing 64
> indirect calls to the next-layer per-packet function with a single
> indirect call to the next-layer list function, we can save CPU cycles.
>
> Drivers pass an SKB list to the stack at the end of the NAPI poll; this
> gives a natural batch size (the NAPI poll weight) and avoids waiting at
> the software level for further packets to make a larger batch (which
> would add latency). It also means that the batch size is automatically
> tuned by the existing interrupt moderation mechanism.
> The stack then runs each layer of processing over all the packets in the
> list before proceeding to the next layer. Where the 'next layer' (or
> the context in which it must run) differs among the packets, the stack
> splits the list; this 'late demux' means that packets which differ only
> in later headers (e.g. same L2/L3 but different L4) can traverse the
> early part of the stack together.
> Also, where the next layer is not (yet) list-aware, the stack can revert
> to calling the rest of the stack in a loop; this allows gradual/creeping
> listification, with no 'flag day' patch needed to listify everything.
>
> Patches 1-2 simply place received packets on a list during the event
> processing loop on the sfc EF10 architecture, then call the normal stack
> for each packet singly at the end of the NAPI poll. (Analogues of patch
> #2 for other NIC drivers should be fairly straightforward.)
> Patches 3-9 extend the list processing as far as the IP receive handler.
> Patches 10-12 apply the list techniques to Generic XDP, since the bpf_func
> there is an indirect call. In patch #12 we JIT a list_func that performs
> list unwrapping and makes direct calls to the bpf_func.
>
> Patches 1-2 alone give about a 10% improvement in packet rate in the
> baseline test; adding patches 3-9 raises this to around 25%. Patches 10-
> 12, intended to improve Generic XDP performance, have in fact slightly
> worsened it; I am unsure why this is and have included them in this RFC
> in the hopes that someone will spot the reason. If no progress is made I
> will drop them from the series.
>
> Performance measurements were made with NetPerf UDP_STREAM, using 1-byte
> packets and a single core to handle interrupts on the RX side; this was
> in order to measure as simply as possible the packet rate handled by a
> single core. Figures are in Mbit/s; divide by 8 to obtain Mpps. The
> setup was tuned for maximum reproducibility, rather than raw performance.
> Full details and more results (both with and without retpolines) are
> presented in [2].
>
> The baseline test uses four streams, and multiple RXQs all bound to a
> single CPU (the netperf binary is bound to a neighbouring CPU). These
> tests were run with retpolines.
> net-next: 6.60 Mb/s (datum)
> after 9: 8.35 Mb/s (datum + 26.6%)
> after 12: 8.29 Mb/s (datum + 25.6%)
> Note however that these results are not robust; changes in the parameters
> of the test often shrink the gain to single-digit percentages. For
> instance, when using only a single RXQ, only a 4% gain was seen. The
> results also seem to change significantly each time the patch series is
> rebased onto a new net-next; for instance the results in [3] with
> retpolines (slide 9) show only 11.6% gain in the same test as above (the
> post-patch performance is the same but the pre-patch datum is 7.5Mb/s).
>
Very nice! I really like the deliberate progression of functionality
in the patches makes follwing them very readable. I do think that XDP
related patches at the end of the set should be separated out.
I suspects the effects will vary a lot between architectures and
configuration, so I'm not too worried about the variance mentioned in
the performance numbers. For future work, it might also be worth it to
compare to techniques done in VPP.
Tom
>
> I also performed tests with Generic XDP enabled (using a simple map-based
> UDP port drop program with no entries in the map), both with and without
> the eBPF JIT enabled.
> No JIT:
> net-next: 3.52 Mb/s (datum)
> after 9: 4.91 Mb/s (datum + 39.5%)
> after 12: 4.83 Mb/s (datum + 37.3%)
>
> With JIT:
> net-next: 5.23 Mb/s (datum)
> after 9: 6.64 Mb/s (datum + 27.0%)
> after 12: 6.46 Mb/s (datum + 23.6%)
>
> Another test variation was the use of software filtering/firewall rules.
> Adding a single iptables rule (a UDP port drop on a port range not
> matching the test traffic), thus making the netfilter hook have work to
> do, reduced baseline performance but showed a similar delta from the
> patches. Similarly, testing with a set of TC flower filters (kindly
> supplied by Cong Wang) in the single-RXQ setup (that previously gave 4%)
> slowed down the baseline but not the patched performance, giving a 5.7%
> performance delta. These data suggest that the batching approach
> remains effective in the presence of software switching rules.
>
> Changes from v1 (see [3]):
> * Rebased across 2 years' net-next movement (surprisingly straightforward).
> - Added Generic XDP handling to netif_receive_skb_list_internal()
> - Dealt with changes to PFMEMALLOC setting APIs
> * General cleanup of code and comments.
> * Skipped function calls for empty lists at various points in the stack
> (patch #9).
> * Added listified Generic XDP handling (patches 10-12), though it doesn't
> seem to help (see above).
> * Extended testing to cover software firewalls / netfilter etc.
>
> [1] http://vger.kernel.org/netconf2018_files/DavidMiller_netconf2018.pdf
> [2] http://vger.kernel.org/netconf2018_files/EdwardCree_netconf2018.pdf
> [3] http://lists.openwall.net/netdev/2016/04/19/89
>
> Edward Cree (12):
> net: core: trivial netif_receive_skb_list() entry point
> sfc: batch up RX delivery
> net: core: unwrap skb list receive slightly further
> net: core: Another step of skb receive list processing
> net: core: another layer of lists, around PF_MEMALLOC skb handling
> net: core: propagate SKB lists through packet_type lookup
> net: ipv4: listified version of ip_rcv
> net: ipv4: listify ip_rcv_finish
> net: don't bother calling list RX functions on empty lists
> net: listify Generic XDP processing, part 1
> net: listify Generic XDP processing, part 2
> net: listify jited Generic XDP processing on x86_64
>
> arch/x86/net/bpf_jit_comp.c | 164 ++++++++++++++
> drivers/net/ethernet/sfc/efx.c | 12 +
> drivers/net/ethernet/sfc/net_driver.h | 3 +
> drivers/net/ethernet/sfc/rx.c | 7 +-
> include/linux/filter.h | 43 +++-
> include/linux/netdevice.h | 4 +
> include/linux/netfilter.h | 27 +++
> include/linux/skbuff.h | 16 ++
> include/net/ip.h | 2 +
> include/trace/events/net.h | 14 ++
> kernel/bpf/core.c | 38 +++-
> net/core/dev.c | 415 +++++++++++++++++++++++++++++-----
> net/core/filter.c | 10 +-
> net/ipv4/af_inet.c | 1 +
> net/ipv4/ip_input.c | 129 ++++++++++-
> 15 files changed, 810 insertions(+), 75 deletions(-)
>
^ permalink raw reply
* Re: [PATCH rdma-next 00/12] RDMA fixes 2018-06-24
From: Jason Gunthorpe @ 2018-06-26 20:39 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Doug Ledford, RDMA mailing list, Matan Barak, Michael J Ruhl,
Noa Osherovich, Raed Salem, Yishai Hadas, Saeed Mahameed,
linux-netdev
In-Reply-To: <20180626042126.GM17747@mtr-leonro.mtl.com>
On Tue, Jun 26, 2018 at 07:21:26AM +0300, Leon Romanovsky wrote:
> On Mon, Jun 25, 2018 at 03:34:38PM -0600, Jason Gunthorpe wrote:
> > On Sun, Jun 24, 2018 at 11:23:41AM +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@mellanox.com>
> > >
> > > Hi,
> > >
> > > This is bunch of patches trigged by running syzkaller internally.
> > >
> > > I'm sending them based on rdma-next mainly for two reasons:
> > > 1, Most of the patches fix the old issues and it doesn't matter when
> > > they will hit the Linus's tree: now or later in a couple of weeks
> > > during merge window.
> > > 2. They interleave with code cleanup, mlx5-next patches and Michael's
> > > feedback on flow counters series.
> > >
> > > Thanks
> > >
> > > Leon Romanovsky (12):
> > > RDMA/uverbs: Protect from attempts to create flows on unsupported QP
> > > RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow
> >
> > I applied these two to for-rc
> >
> > > RDMA/uverbs: Check existence of create_flow callback
> > > RDMA/verbs: Drop kernel variant of create_flow
> > > RDMA/verbs: Drop kernel variant of destroy_flow
> > > net/mlx5: Rate limit errors in command interface
> > > RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR
> > > RDMA/umem: Don't check for negative return value of dma_map_sg_attrs()
> > > RDMA/uverbs: Remove redundant check
> >
> > These to for-next
>
> Jason,
>
> We would like to see patch "[PATCH mlx5-next 05/12] net/mlx5:
> Rate limit errors in command interface" in out mlx5-next. Is it possible
> at this point to drop it from for-next, so I'll be able to take it into
> mlx5-next?
Okay, you got to this while it was still 'wip', so it is dropped. Add
it to the mlx5-next branch and netdev or rdma can pull it next time
there is some reason to pull the branch..
Jason
^ permalink raw reply
* Re: Suspend of SDIO function devices
From: Daniel Mack @ 2018-06-26 20:34 UTC (permalink / raw)
To: Ulf Hansson
Cc: Chris Ball, linux-mmc@vger.kernel.org,
libertas-dev@lists.infradead.org, linux-wireless,
netdev@vger.kernel.org
In-Reply-To: <CAPDyKFrgcFrah34+eFMsBQYWr03bTSMMUuYw61jYfpffSQrQAg@mail.gmail.com>
Hi Ulf,
Thanks for the prompt reply!
On Monday, June 25, 2018 05:00 PM, Ulf Hansson wrote:
> On 24 June 2018 at 22:46, Daniel Mack <daniel@zonque.org> wrote:
>> Please advise, I'm happy to test approaches and send patches.
>
> From a top level point of view, I think this needs to be changed:
>
> 1)
> In cases when the libertas sdio driver's ->suspend() callback, thinks
> of returning -ENOSYS, it should instead call if_sdio_power_off().
> Depending if if_sdio_power_save() has already been called, this shall
> be skipped.
>
> The important thing here is to disable the SDIO func device and to
> release the SDIO irq.
>
> 2)
> During resume, depending on whether the earlier ->suspend() callback
> invoked if_sdio_power_off(), libertas sdio driver's ->resume()
> callback should call if_sdio_power_on().
>
> This should re-initiate the libertas sdio device and re-program the
> firmware. To complete these actions, the firmware file also needs to
> be fetched, which requires file system accesses also to be resumed.
>
> We also need to wait for the firmware programming to be completed,
> hence also do a "wait_event(card->pwron_waitq, priv->fw_ready);" from
> somewhere.
Great, thanks, that helped a lot! I have something that does the job ob
my board. Will send out a patch shortly.
Best regards,
Daniel
^ permalink raw reply
* Re: [PATCH bpf-next 2/3] bpf: btf: add btf json print functionality
From: Jakub Kicinski @ 2018-06-26 20:31 UTC (permalink / raw)
To: Okash Khawaja
Cc: Martin KaFai Lau, Daniel Borkmann, Alexei Starovoitov,
Yonghong Song, Quentin Monnet, David S. Miller, netdev,
kernel-team, linux-kernel
In-Reply-To: <20180626164820.GA12981@w1t1fb>
On Tue, 26 Jun 2018 17:48:22 +0100, Okash Khawaja wrote:
> On Fri, Jun 22, 2018 at 05:26:39PM -0700, Martin KaFai Lau wrote:
> > On Fri, Jun 22, 2018 at 04:32:00PM -0700, Jakub Kicinski wrote:
> > > On Fri, 22 Jun 2018 15:54:08 -0700, Martin KaFai Lau wrote:
> > > > > > > > > > > > > "value": ["0x02","0x00","0x00","0x00","0x00","0x00","0x00","0x00"
> > > > > > > > > > > > > ],
> > > > > > > > > > > > > "value_struct":{
> > > > > > > > > > > > > "src_ip":2,
> > > > > > > > If for the same map the user changes the "src_ip" to an array of int[4]
> > > > > > > > later (e.g. to support ipv6), it will become "src_ip": [1, 2, 3, 4].
> > > > > > > > Is it breaking backward compat?
> > > > > > > > i.e.
> > > > > > > > struct five_tuples {
> > > > > > > > - int src_ip;
> > > > > > > > + int src_ip[4];
> > > > > > > > /* ... */
> > > > > > > > };
> > > > > > >
> > > > > > > Well, it is breaking backward compat, but it's the program doing it,
> > > > > > > not bpftool :) BTF changes so does the output.
> > > > > > As we see, the key/value's btf-output is inherently not backward compat.
> > > > > > Hence, "-j" and "-p" will stay as is. The whole existing json will
> > > > > > be backward compat instead of only partly backward compat.
> > > > >
> > > > > No. There is a difference between user of a facility changing their
> > > > > input and kernel/libraries providing different output in response to
> > > > > that, and the libraries suddenly changing the output on their own.
> > > > >
> > > > > Your example is like saying if user started using IPv6 addresses
> > > > > instead of IPv4 the netlink attributes in dumps will be different so
> > > > > kernel didn't keep backwards compat. While what you're doing is more
> > > > > equivalent to dropping support for old ioctl interfaces because there
> > > > > is a better mechanism now.
> > > > Sorry, I don't follow this. I don't see netlink suffer json issue like
> > > > the one on "key" and "value".
> > > >
> > > > All I can grasp is, the json should normally be backward compat but now
> > > > we are saying anything added by btf-output is an exception because
> > > > the script parsing it will treat it differently than "key" and "value"
> > >
> > > Backward compatibility means that if I run *the same* program against
> > > different kernels/libraries it continues to work. If someone decides
> > > to upgrade their program to work with IPv6 (which was your example)
> > > obviously there is no way system as a whole will look 1:1 the same.
> > >
> > > > > BTF in JSON is very useful, and will help people who writes simple
> > > > > orchestration/scripts based on bpftool *a* *lot*. I really appreciate
> > > > Can you share what the script will do? I want to understand why
> > > > it cannot directly use the BTF format and the map data.
> > >
> > > Think about a python script which wants to read a counter in a map.
> > > Right now it would have to get the BTF, find out which bytes are the
> > > counter, then convert the bytes into a larger int. With JSON BTF if
> > > just does entry["formatted"]["value"]["counter"].
> > >
> > > Real life example from my test code (conversion of 3 element counter
> > > array):
> > >
> > > def str2int(strtab):
> > > inttab = []
> > > for i in strtab:
> > > inttab.append(int(i, 16))
> > > ba = bytearray(inttab)
> > > if len(strtab) == 4:
> > > fmt = "I"
> > > elif len(strtab) == 8:
> > > fmt = "Q"
> > > else:
> > > raise Exception("String array of len %d can't be unpacked to an int" %
> > > (len(strtab)))
> > > return struct.unpack(fmt, ba)[0]
> > >
> > > def convert(elems, idx):
> > > val = []
> > > for i in range(3):
> > > part = elems[idx]["value"][i * length:(i + 1) * length]
> > > val.append(str2int(part))
> > > return val
> > >
> > > With BTF it would be:
> > >
> > > elems[idx]["formatted"]["value"]
> > >
> > > Which is fairly awesome.
> > Thanks for the example. Agree that with BTF, things are easier in general.
> >
> > btw, what more awesome is,
> > #> bpftool map find id 100 key 1
> > {
> > "counter_x": 1,
> > "counter_y": 10
> > }
> >
> > >
> > > > > this addition to bpftool and will start using it myself as soon as it
> > > > > lands. I'm not sure why the reluctance to slightly change the output
> > > > > format?
> > > > The initial change argument is because the json has to be backward compat.
> > > >
> > > > Then we show that btf-output is inherently not backward compat, so
> > > > printing it in json does not make sense at all.
> > > >
> > > > However, now it is saying part of it does not have to be backward compat.
> > >
> > > Compatibility of "formatted" member is defined as -> fields broken out
> > > according to BTF. So it is backward compatible. The definition of
> > > "value" member is -> an array of unfortunately formatted array of
> > > ugly hex strings :(
> > >
> > > > I am fine putting it under "formatted" for "-j" or "-p" if that is the
> > > > case, other than the double output is still confusing. Lets wait for
> > > > Okash's input.
> > > >
> > > > At the same time, the same output will be used as the default plaintext
> > > > output when BTF is available. Then the plaintext BTF output
> > > > will not be limited by the json restrictions when we want
> > > > to improve human readability later. Apparently, the
> > > > improvements on plaintext will not be always applicable
> > > > to json output.
> > >
>
> hi,
>
> so i guess following is what we want:
>
> 1. a "formatted" object nested inside -p and -j switches for bpf map
> dump. this will be JSON and backward compatible
> 2. an output for humans - which is like the current output. this will
> not be JSON. this won't have to be backward compatible. this output will
> be shown when neither of -j and -p are supplied and btf info is
> available.
>
> i can update the patches to v2 which covers 2 above + all other comments
> on the patchset. later we can follow up with a patch for 1.
Please do both at the same time. I've learnt not to trust people when
they say things like "we can follow up with xyz" :( In my experience it
_always_ backfires.
Implementing both outputs in one series will help you structure your
code to best suit both of the formats up front.
^ permalink raw reply
* Re: [PATCH] dt-bindings: Fix unbalanced quotation marks
From: Rob Herring @ 2018-06-26 20:19 UTC (permalink / raw)
To: Jonathan Neuschäfer
Cc: devicetree, Kukjin Kim, Krzysztof Kozlowski, Mark Rutland,
Linus Walleij, Dmitry Torokhov, Thomas Gleixner, Jason Cooper,
Marc Zyngier, Thierry Reding, Jonathan Hunter, Maxime Coquelin,
Alexandre Torgue, Hauke Mehrtens, Rafał Miłecki,
Ralf Baechle, Paul Burton, James Hogan, Madalin Bucur
In-Reply-To: <20180617143127.11421-1-j.neuschaefer@gmx.net>
On Sun, Jun 17, 2018 at 04:31:18PM +0200, Jonathan Neuschäfer wrote:
> Multiple binding documents have various forms of unbalanced quotation
> marks. Fix them.
>
> Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
> ---
>
> Should I split this patch so that different parts can go through different trees?
No, I applied it. Thanks.
> ---
> .../devicetree/bindings/arm/samsung/samsung-boards.txt | 2 +-
> .../devicetree/bindings/gpio/nintendo,hollywood-gpio.txt | 2 +-
> Documentation/devicetree/bindings/input/touchscreen/hideep.txt | 2 +-
> .../bindings/interrupt-controller/nvidia,tegra20-ictlr.txt | 2 +-
> .../devicetree/bindings/interrupt-controller/st,stm32-exti.txt | 2 +-
> Documentation/devicetree/bindings/mips/brcm/soc.txt | 2 +-
> Documentation/devicetree/bindings/net/fsl-fman.txt | 2 +-
> Documentation/devicetree/bindings/power/power_domain.txt | 2 +-
> Documentation/devicetree/bindings/regulator/tps65090.txt | 2 +-
> Documentation/devicetree/bindings/reset/st,sti-softreset.txt | 2 +-
> Documentation/devicetree/bindings/sound/qcom,apq8016-sbc.txt | 2 +-
> Documentation/devicetree/bindings/sound/qcom,apq8096.txt | 2 +-
> 12 files changed, 12 insertions(+), 12 deletions(-)
^ permalink raw reply
* Re: [PATCH net-next 2/4] net/sched: act_tunnel_key: add extended ack support
From: David Ahern @ 2018-06-26 19:50 UTC (permalink / raw)
To: Jakub Kicinski, davem, jbenc
Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
netdev, Simon Horman, Alexander Aring, Pieter Jansen van Vuuren
In-Reply-To: <20180626185308.3605-3-jakub.kicinski@netronome.com>
On 6/26/18 12:53 PM, Jakub Kicinski wrote:
> From: Simon Horman <simon.horman@netronome.com>
>
> Add extended ack support for the tunnel key action by using NL_SET_ERR_MSG
> during validation of user input.
>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Alexander Aring <aring@mojatatu.com>
> Signed-off-by: Simon Horman <simon.horman@netronome.com>
> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
> net/sched/act_tunnel_key.c | 27 +++++++++++++++++++++------
> 1 file changed, 21 insertions(+), 6 deletions(-)
>
Reviewed-by: David Ahern <dsahern@gmail.com>
^ permalink raw reply
* [PATCH net-next] netlink: Return extack message if attribute validation fails
From: dsahern @ 2018-06-26 19:39 UTC (permalink / raw)
To: netdev; +Cc: jakub.kicinski, David Ahern
From: David Ahern <dsahern@gmail.com>
Have one extack message for parsing and validating.
Signed-off-by: David Ahern <dsahern@gmail.com>
---
lib/nlattr.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/nlattr.c b/lib/nlattr.c
index dfa55c873c13..e335bcafa9e4 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -253,8 +253,8 @@ int nla_parse(struct nlattr **tb, int maxtype, const struct nlattr *head,
if (policy) {
err = validate_nla(nla, maxtype, policy);
if (err < 0) {
- if (extack)
- extack->bad_attr = nla;
+ NL_SET_ERR_MSG_ATTR(extack, nla,
+ "Attribute failed policy validation");
goto errout;
}
}
--
2.11.0
^ permalink raw reply related
* Re: [PATCH net-next v2 3/3] net: phy: xgmiitorgmii: Check read_status results
From: Andrew Lunn @ 2018-06-26 19:38 UTC (permalink / raw)
To: Brandon Maier
Cc: netdev, f.fainelli, davem, michal.simek, clayton.shotwell,
kristopher.cory, linux-kernel
In-Reply-To: <20180626175050.71165-3-brandon.maier@rockwellcollins.com>
On Tue, Jun 26, 2018 at 12:50:50PM -0500, Brandon Maier wrote:
> We're ignoring the result of the attached phy device's read_status().
> Return it so we can detect errors.
>
> Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Andrew
^ permalink raw reply
* Re: [PATCH net-next v2 2/3] net: phy: xgmiitorgmii: Use correct mdio bus
From: Andrew Lunn @ 2018-06-26 19:37 UTC (permalink / raw)
To: Brandon Maier
Cc: netdev, f.fainelli, davem, michal.simek, clayton.shotwell,
kristopher.cory, linux-kernel
In-Reply-To: <20180626175050.71165-2-brandon.maier@rockwellcollins.com>
On Tue, Jun 26, 2018 at 12:50:49PM -0500, Brandon Maier wrote:
> The xgmiitorgmii is using the mii_bus of the device it's attached to,
> instead of the bus it was given during probe.
>
> Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Andrew
^ permalink raw reply
* Re: [PATCH net-next v2 1/3] net: phy: xgmiitorgmii: Check phy_driver ready before accessing
From: Andrew Lunn @ 2018-06-26 19:36 UTC (permalink / raw)
To: Brandon Maier
Cc: netdev, f.fainelli, davem, michal.simek, clayton.shotwell,
kristopher.cory, linux-kernel
In-Reply-To: <20180626175050.71165-1-brandon.maier@rockwellcollins.com>
On Tue, Jun 26, 2018 at 12:50:48PM -0500, Brandon Maier wrote:
> Since a phy_device is added to the global mdio_bus list during
> phy_device_register(), but a phy_device's phy_driver doesn't get
> attached until phy_probe(). It's possible of_phy_find_device() in
> xgmiitorgmii will return a valid phy with a NULL phy_driver. Leading to
> a NULL pointer access during the memcpy().
> Signed-off-by: Brandon Maier <brandon.maier@rockwellcollins.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Andrew
^ permalink raw reply
* Re: [patch net-next RFC 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Andrew Lunn @ 2018-06-26 19:35 UTC (permalink / raw)
To: Vadim Pasternak
Cc: Guenter Roeck, linux-pm@vger.kernel.org, netdev@vger.kernel.org,
rui.zhang@intel.com, edubezval@gmail.com, jiri@resnulli.us
In-Reply-To: <HE1PR0502MB37535AE1B1E7093E198F0A67A2490@HE1PR0502MB3753.eurprd05.prod.outlook.com>
On Tue, Jun 26, 2018 at 07:01:32PM +0000, Vadim Pasternak wrote:
>
>
> > -----Original Message-----
> > From: Andrew Lunn [mailto:andrew@lunn.ch]
> > Sent: Tuesday, June 26, 2018 9:18 PM
> > To: Vadim Pasternak <vadimp@mellanox.com>
> > Cc: Guenter Roeck <linux@roeck-us.net>; linux-pm@vger.kernel.org;
> > netdev@vger.kernel.org; rui.zhang@intel.com; edubezval@gmail.com;
> > jiri@resnulli.us
> > Subject: Re: [patch net-next RFC 03/12] mlxsw: core: Add core environment
> > module for port temperature reading
> >
> > > However, I have some concerns on this matter.
> > > Our hardware provides bulk reading of the modules temperature, means I
> > > can get all inputs by one hardware request, which is important optimization.
> > > Reading each module individually will be resulted in huge overhead and
> > > will require maybe some cashing of temperature inputs.
> >
> > Well, you can cache the SFP calibration values, and the 4 limit values. To get an
> > actually temperature you need to read 2 bytes from the SFP module. I don't see
> > why that would be expensive. You talk to the firmware over PCIe right? So you
> > have lots of bandwidth.
>
> Yes, but FW in its turn will run I2C transaction to read temperature sensor.
So how does that add overhead? It needs to read the same two bytes
independent of if it is getting readings from one sensor, or all
sensors.
> And we also run hwmon and thermal parts of our driver on BMC (Based
> Management Controller) on system equipped with it.
> In such case host CPU performs networking stuff, while BMC system related
> stuff. And in such configuration BMC talks to FW over I2C.
So you have a 20MHz I2C bus between your BMC and the firmware. Lets
assume a relativity dumb protocol. 2 bytes for command to read an sfp
sensor, 3 bytes for a replay. 5 bytes, at 20Mbps allows you to read
500,000 sensors per second. And for environment monitoring, 64 sensors
one per second should be sufficient.
Andrew
^ permalink raw reply
* RE: [patch net-next RFC 03/12] mlxsw: core: Add core environment module for port temperature reading
From: Vadim Pasternak @ 2018-06-26 19:01 UTC (permalink / raw)
To: Andrew Lunn
Cc: Guenter Roeck, linux-pm@vger.kernel.org, netdev@vger.kernel.org,
rui.zhang@intel.com, edubezval@gmail.com, jiri@resnulli.us
In-Reply-To: <20180626181821.GA9800@lunn.ch>
> -----Original Message-----
> From: Andrew Lunn [mailto:andrew@lunn.ch]
> Sent: Tuesday, June 26, 2018 9:18 PM
> To: Vadim Pasternak <vadimp@mellanox.com>
> Cc: Guenter Roeck <linux@roeck-us.net>; linux-pm@vger.kernel.org;
> netdev@vger.kernel.org; rui.zhang@intel.com; edubezval@gmail.com;
> jiri@resnulli.us
> Subject: Re: [patch net-next RFC 03/12] mlxsw: core: Add core environment
> module for port temperature reading
>
> > However, I have some concerns on this matter.
> > Our hardware provides bulk reading of the modules temperature, means I
> > can get all inputs by one hardware request, which is important optimization.
> > Reading each module individually will be resulted in huge overhead and
> > will require maybe some cashing of temperature inputs.
>
> Well, you can cache the SFP calibration values, and the 4 limit values. To get an
> actually temperature you need to read 2 bytes from the SFP module. I don't see
> why that would be expensive. You talk to the firmware over PCIe right? So you
> have lots of bandwidth.
Yes, but FW in its turn will run I2C transaction to read temperature sensor.
And we also run hwmon and thermal parts of our driver on BMC (Based
Management Controller) on system equipped with it.
In such case host CPU performs networking stuff, while BMC system related
stuff. And in such configuration BMC talks to FW over I2C.
So I'll must to cache.
>
> Andrew
^ permalink raw reply
* [PATCH net-next 4/4] net/sched: add tunnel option support to act_tunnel_key
From: Jakub Kicinski @ 2018-06-26 18:53 UTC (permalink / raw)
To: davem, jbenc
Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
netdev, Simon Horman, Pieter Jansen van Vuuren
In-Reply-To: <20180626185308.3605-1-jakub.kicinski@netronome.com>
From: Simon Horman <simon.horman@netronome.com>
Allow setting tunnel options using the act_tunnel_key action.
Options are expressed as class:type:data and multiple options
may be listed using a comma delimiter.
# ip link add name geneve0 type geneve dstport 0 external
# tc qdisc add dev eth0 ingress
# tc filter add dev eth0 protocol ip parent ffff: \
flower indev eth0 \
ip_proto udp \
action tunnel_key \
set src_ip 10.0.99.192 \
dst_ip 10.0.99.193 \
dst_port 6081 \
id 11 \
geneve_opts 0102:80:00800022,0102:80:00800022 \
action mirred egress redirect dev geneve0
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
include/uapi/linux/tc_act/tc_tunnel_key.h | 26 +++
net/sched/act_tunnel_key.c | 214 +++++++++++++++++++++-
2 files changed, 236 insertions(+), 4 deletions(-)
diff --git a/include/uapi/linux/tc_act/tc_tunnel_key.h b/include/uapi/linux/tc_act/tc_tunnel_key.h
index 72bbefe5d1d1..1b7bdd841b98 100644
--- a/include/uapi/linux/tc_act/tc_tunnel_key.h
+++ b/include/uapi/linux/tc_act/tc_tunnel_key.h
@@ -36,9 +36,35 @@ enum {
TCA_TUNNEL_KEY_PAD,
TCA_TUNNEL_KEY_ENC_DST_PORT, /* be16 */
TCA_TUNNEL_KEY_NO_CSUM, /* u8 */
+ TCA_TUNNEL_KEY_ENC_OPTS, /* Nested TCA_TUNNEL_KEY_ENC_OPTS_
+ * attributes
+ */
__TCA_TUNNEL_KEY_MAX,
};
#define TCA_TUNNEL_KEY_MAX (__TCA_TUNNEL_KEY_MAX - 1)
+enum {
+ TCA_TUNNEL_KEY_ENC_OPTS_UNSPEC,
+ TCA_TUNNEL_KEY_ENC_OPTS_GENEVE, /* Nested
+ * TCA_TUNNEL_KEY_ENC_OPTS_
+ * attributes
+ */
+ __TCA_TUNNEL_KEY_ENC_OPTS_MAX,
+};
+
+#define TCA_TUNNEL_KEY_ENC_OPTS_MAX (__TCA_TUNNEL_KEY_ENC_OPTS_MAX - 1)
+
+enum {
+ TCA_TUNNEL_KEY_ENC_OPT_GENEVE_UNSPEC,
+ TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS, /* u16 */
+ TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE, /* u8 */
+ TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA, /* 4 to 128 bytes */
+
+ __TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX,
+};
+
+#define TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX \
+ (__TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX - 1)
+
#endif
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 20e98ed8d498..5eace86ea687 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -13,6 +13,7 @@
#include <linux/kernel.h>
#include <linux/skbuff.h>
#include <linux/rtnetlink.h>
+#include <net/geneve.h>
#include <net/netlink.h>
#include <net/pkt_sched.h>
#include <net/dst.h>
@@ -57,6 +58,135 @@ static int tunnel_key_act(struct sk_buff *skb, const struct tc_action *a,
return action;
}
+static const struct nla_policy
+enc_opts_policy[TCA_TUNNEL_KEY_ENC_OPTS_MAX + 1] = {
+ [TCA_TUNNEL_KEY_ENC_OPTS_GENEVE] = { .type = NLA_NESTED },
+};
+
+static const struct nla_policy
+geneve_opt_policy[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1] = {
+ [TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS] = { .type = NLA_U16 },
+ [TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE] = { .type = NLA_U8 },
+ [TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA] = { .type = NLA_BINARY,
+ .len = 128 },
+};
+
+static int
+tunnel_key_copy_geneve_opt(const struct nlattr *nla, void *dst, int dst_len,
+ struct netlink_ext_ack *extack)
+{
+ struct nlattr *tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1];
+ int err, data_len, opt_len;
+ u8 *data;
+
+ err = nla_parse_nested(tb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX,
+ nla, geneve_opt_policy, extack);
+ if (err < 0)
+ return err;
+
+ if (!tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS] ||
+ !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE] ||
+ !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]) {
+ NL_SET_ERR_MSG(extack, "Missing tunnel key geneve option class, type or data");
+ return -EINVAL;
+ }
+
+ data = nla_data(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
+ data_len = nla_len(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
+ if (data_len < 4) {
+ NL_SET_ERR_MSG(extack, "Tunnel key geneve option data is less than 4 bytes long");
+ return -ERANGE;
+ }
+ if (data_len % 4) {
+ NL_SET_ERR_MSG(extack, "Tunnel key geneve option data is not a multiple of 4 bytes long");
+ return -ERANGE;
+ }
+
+ opt_len = sizeof(struct geneve_opt) + data_len;
+ if (dst) {
+ struct geneve_opt *opt = dst;
+
+ WARN_ON(dst_len < opt_len);
+
+ opt->opt_class =
+ nla_get_u16(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS]);
+ opt->type = nla_get_u8(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE]);
+ opt->length = data_len / 4; /* length is in units of 4 bytes */
+ opt->r1 = 0;
+ opt->r2 = 0;
+ opt->r3 = 0;
+
+ memcpy(opt + 1, data, data_len);
+ }
+
+ return opt_len;
+}
+
+static int tunnel_key_copy_opts(const struct nlattr *nla, u8 *dst,
+ int dst_len, struct netlink_ext_ack *extack)
+{
+ int err, rem, opt_len, len = nla_len(nla), opts_len = 0;
+ const struct nlattr *attr, *head = nla_data(nla);
+
+ err = nla_validate(head, len, TCA_TUNNEL_KEY_ENC_OPTS_MAX,
+ enc_opts_policy, extack);
+ if (err)
+ return err;
+
+ nla_for_each_attr(attr, head, len, rem) {
+ switch (nla_type(attr)) {
+ case TCA_TUNNEL_KEY_ENC_OPTS_GENEVE:
+ opt_len = tunnel_key_copy_geneve_opt(attr, dst,
+ dst_len, extack);
+ if (opt_len < 0)
+ return opt_len;
+ opts_len += opt_len;
+ if (dst) {
+ dst_len -= opt_len;
+ dst += opt_len;
+ }
+ break;
+ }
+ }
+
+ if (!opts_len) {
+ NL_SET_ERR_MSG(extack, "Empty list of tunnel options");
+ return -EINVAL;
+ }
+
+ if (rem > 0) {
+ NL_SET_ERR_MSG(extack, "Trailing data after parsing tunnel key options attributes");
+ return -EINVAL;
+ }
+
+ return opts_len;
+}
+
+static int tunnel_key_get_opts_len(struct nlattr *nla,
+ struct netlink_ext_ack *extack)
+{
+ return tunnel_key_copy_opts(nla, NULL, 0, extack);
+}
+
+static int tunnel_key_opts_set(struct nlattr *nla, struct ip_tunnel_info *info,
+ int opts_len, struct netlink_ext_ack *extack)
+{
+ info->options_len = opts_len;
+ switch (nla_type(nla_data(nla))) {
+ case TCA_TUNNEL_KEY_ENC_OPTS_GENEVE:
+#if IS_ENABLED(CONFIG_INET)
+ info->key.tun_flags |= TUNNEL_GENEVE_OPT;
+ return tunnel_key_copy_opts(nla, ip_tunnel_info_opts(info),
+ opts_len, extack);
+#else
+ return -EAFNOSUPPORT;
+#endif
+ default:
+ NL_SET_ERR_MSG(extack, "Cannot set tunnel options for unknown tunnel type");
+ return -EINVAL;
+ }
+}
+
static const struct nla_policy tunnel_key_policy[TCA_TUNNEL_KEY_MAX + 1] = {
[TCA_TUNNEL_KEY_PARMS] = { .len = sizeof(struct tc_tunnel_key) },
[TCA_TUNNEL_KEY_ENC_IPV4_SRC] = { .type = NLA_U32 },
@@ -66,6 +196,7 @@ static const struct nla_policy tunnel_key_policy[TCA_TUNNEL_KEY_MAX + 1] = {
[TCA_TUNNEL_KEY_ENC_KEY_ID] = { .type = NLA_U32 },
[TCA_TUNNEL_KEY_ENC_DST_PORT] = {.type = NLA_U16},
[TCA_TUNNEL_KEY_NO_CSUM] = { .type = NLA_U8 },
+ [TCA_TUNNEL_KEY_ENC_OPTS] = { .type = NLA_NESTED },
};
static int tunnel_key_init(struct net *net, struct nlattr *nla,
@@ -81,6 +212,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
struct tcf_tunnel_key *t;
bool exists = false;
__be16 dst_port = 0;
+ int opts_len = 0;
__be64 key_id;
__be16 flags;
int ret = 0;
@@ -128,6 +260,15 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
if (tb[TCA_TUNNEL_KEY_ENC_DST_PORT])
dst_port = nla_get_be16(tb[TCA_TUNNEL_KEY_ENC_DST_PORT]);
+ if (tb[TCA_TUNNEL_KEY_ENC_OPTS]) {
+ opts_len = tunnel_key_get_opts_len(tb[TCA_TUNNEL_KEY_ENC_OPTS],
+ extack);
+ if (opts_len < 0) {
+ ret = opts_len;
+ goto err_out;
+ }
+ }
+
if (tb[TCA_TUNNEL_KEY_ENC_IPV4_SRC] &&
tb[TCA_TUNNEL_KEY_ENC_IPV4_DST]) {
__be32 saddr;
@@ -138,7 +279,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
metadata = __ip_tun_set_dst(saddr, daddr, 0, 0,
dst_port, flags,
- key_id, 0);
+ key_id, opts_len);
} else if (tb[TCA_TUNNEL_KEY_ENC_IPV6_SRC] &&
tb[TCA_TUNNEL_KEY_ENC_IPV6_DST]) {
struct in6_addr saddr;
@@ -162,6 +303,14 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
goto err_out;
}
+ if (opts_len) {
+ ret = tunnel_key_opts_set(tb[TCA_TUNNEL_KEY_ENC_OPTS],
+ &metadata->u.tun_info,
+ opts_len, extack);
+ if (ret < 0)
+ goto err_out;
+ }
+
metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
break;
default:
@@ -234,6 +383,61 @@ static void tunnel_key_release(struct tc_action *a)
}
}
+static int tunnel_key_geneve_opts_dump(struct sk_buff *skb,
+ const struct ip_tunnel_info *info)
+{
+ int len = info->options_len;
+ u8 *src = (u8 *)(info + 1);
+ struct nlattr *start;
+
+ start = nla_nest_start(skb, TCA_TUNNEL_KEY_ENC_OPTS_GENEVE);
+ if (!start)
+ return -EMSGSIZE;
+
+ while (len > 0) {
+ struct geneve_opt *opt = (struct geneve_opt *)src;
+
+ if (nla_put_u16(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS,
+ opt->opt_class) ||
+ nla_put_u8(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE,
+ opt->type) ||
+ nla_put(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA,
+ opt->length * 4, opt + 1))
+ return -EMSGSIZE;
+
+ len -= sizeof(struct geneve_opt) + opt->length * 4;
+ src += sizeof(struct geneve_opt) + opt->length * 4;
+ }
+
+ nla_nest_end(skb, start);
+ return 0;
+}
+
+static int tunnel_key_opts_dump(struct sk_buff *skb,
+ const struct ip_tunnel_info *info)
+{
+ struct nlattr *start;
+ int err;
+
+ if (!info->options_len)
+ return 0;
+
+ start = nla_nest_start(skb, TCA_TUNNEL_KEY_ENC_OPTS);
+ if (!start)
+ return -EMSGSIZE;
+
+ if (info->key.tun_flags & TUNNEL_GENEVE_OPT) {
+ err = tunnel_key_geneve_opts_dump(skb, info);
+ if (err)
+ return err;
+ } else {
+ return -EINVAL;
+ }
+
+ nla_nest_end(skb, start);
+ return 0;
+}
+
static int tunnel_key_dump_addresses(struct sk_buff *skb,
const struct ip_tunnel_info *info)
{
@@ -284,8 +488,9 @@ static int tunnel_key_dump(struct sk_buff *skb, struct tc_action *a,
goto nla_put_failure;
if (params->tcft_action == TCA_TUNNEL_KEY_ACT_SET) {
- struct ip_tunnel_key *key =
- ¶ms->tcft_enc_metadata->u.tun_info.key;
+ struct ip_tunnel_info *info =
+ ¶ms->tcft_enc_metadata->u.tun_info;
+ struct ip_tunnel_key *key = &info->key;
__be32 key_id = tunnel_id_to_key32(key->tun_id);
if (nla_put_be32(skb, TCA_TUNNEL_KEY_ENC_KEY_ID, key_id) ||
@@ -293,7 +498,8 @@ static int tunnel_key_dump(struct sk_buff *skb, struct tc_action *a,
¶ms->tcft_enc_metadata->u.tun_info) ||
nla_put_be16(skb, TCA_TUNNEL_KEY_ENC_DST_PORT, key->tp_dst) ||
nla_put_u8(skb, TCA_TUNNEL_KEY_NO_CSUM,
- !(key->tun_flags & TUNNEL_CSUM)))
+ !(key->tun_flags & TUNNEL_CSUM)) ||
+ tunnel_key_opts_dump(skb, info))
goto nla_put_failure;
}
--
2.17.1
^ permalink raw reply related
* [PATCH net-next 3/4] net: check tunnel option type in tunnel flags
From: Jakub Kicinski @ 2018-06-26 18:53 UTC (permalink / raw)
To: davem, jbenc
Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
netdev, Pieter Jansen van Vuuren, Jakub Kicinski
In-Reply-To: <20180626185308.3605-1-jakub.kicinski@netronome.com>
From: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Check the tunnel option type stored in tunnel flags when creating options
for tunnels. Thereby ensuring we do not set geneve, vxlan or erspan tunnel
options on interfaces that are not associated with them.
Make sure all users of the infrastructure set correct flags, for the BPF
helper we have to set all bits to keep backward compatibility.
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
drivers/net/geneve.c | 6 ++++--
drivers/net/vxlan.c | 3 ++-
include/net/ip_tunnels.h | 8 ++++++--
net/core/filter.c | 2 +-
net/ipv4/ip_gre.c | 2 ++
net/ipv6/ip6_gre.c | 2 ++
net/openvswitch/flow_netlink.c | 8 ++++++--
7 files changed, 23 insertions(+), 8 deletions(-)
diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 3e94375b9b01..471edd76ff55 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -236,7 +236,8 @@ static void geneve_rx(struct geneve_dev *geneve, struct geneve_sock *gs,
}
/* Update tunnel dst according to Geneve options. */
ip_tunnel_info_opts_set(&tun_dst->u.tun_info,
- gnvh->options, gnvh->opt_len * 4);
+ gnvh->options, gnvh->opt_len * 4,
+ TUNNEL_GENEVE_OPT);
} else {
/* Drop packets w/ critical options,
* since we don't support any...
@@ -675,7 +676,8 @@ static void geneve_build_header(struct genevehdr *geneveh,
geneveh->proto_type = htons(ETH_P_TEB);
geneveh->rsvd2 = 0;
- ip_tunnel_info_opts_get(geneveh->options, info);
+ if (info->key.tun_flags & TUNNEL_GENEVE_OPT)
+ ip_tunnel_info_opts_get(geneveh->options, info);
}
static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb,
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index cc14e0cd5647..7eb30d7c8bd7 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2122,7 +2122,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
vni = tunnel_id_to_key32(info->key.tun_id);
ifindex = 0;
dst_cache = &info->dst_cache;
- if (info->options_len)
+ if (info->options_len &&
+ info->key.tun_flags & TUNNEL_VXLAN_OPT)
md = ip_tunnel_info_opts(info);
ttl = info->key.ttl;
tos = info->key.tos;
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 90ff430f5e9d..b0d022ff6ea1 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -466,10 +466,12 @@ static inline void ip_tunnel_info_opts_get(void *to,
}
static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info,
- const void *from, int len)
+ const void *from, int len,
+ __be16 flags)
{
memcpy(ip_tunnel_info_opts(info), from, len);
info->options_len = len;
+ info->key.tun_flags |= flags;
}
static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state *lwtstate)
@@ -511,9 +513,11 @@ static inline void ip_tunnel_info_opts_get(void *to,
}
static inline void ip_tunnel_info_opts_set(struct ip_tunnel_info *info,
- const void *from, int len)
+ const void *from, int len,
+ __be16 flags)
{
info->options_len = 0;
+ info->key.tun_flags |= flags;
}
#endif /* CONFIG_INET */
diff --git a/net/core/filter.c b/net/core/filter.c
index e7f12e9f598c..dade922678f6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3582,7 +3582,7 @@ BPF_CALL_3(bpf_skb_set_tunnel_opt, struct sk_buff *, skb,
if (unlikely(size > IP_TUNNEL_OPTS_MAX))
return -ENOMEM;
- ip_tunnel_info_opts_set(info, from, size);
+ ip_tunnel_info_opts_set(info, from, size, TUNNEL_OPTIONS_PRESENT);
return 0;
}
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 2d8efeecf619..c8ca5d8f0f75 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -587,6 +587,8 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
goto err_free_skb;
key = &tun_info->key;
+ if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT))
+ goto err_free_rt;
md = ip_tunnel_info_opts(tun_info);
if (!md)
goto err_free_rt;
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index c8cf2fdbb13b..367177786e34 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -990,6 +990,8 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb,
fl6.flowi6_uid = sock_net_uid(dev_net(dev), NULL);
dsfield = key->tos;
+ if (!(tun_info->key.tun_flags & TUNNEL_ERSPAN_OPT))
+ goto tx_err;
md = ip_tunnel_info_opts(tun_info);
if (!md)
goto tx_err;
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 492ab0c36f7c..26ab964923fd 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -2515,8 +2515,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
struct ip_tunnel_info *tun_info;
struct ovs_tunnel_info *ovs_tun;
struct nlattr *a;
- int err = 0, start, opts_type;
+ int err = 0, start, opts_type, dst_opt_type;
+ dst_opt_type = 0;
ovs_match_init(&match, &key, true, NULL);
opts_type = ip_tun_from_nlattr(nla_data(attr), &match, false, log);
if (opts_type < 0)
@@ -2528,10 +2529,13 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
err = validate_geneve_opts(&key);
if (err < 0)
return err;
+ dst_opt_type = TUNNEL_GENEVE_OPT;
break;
case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+ dst_opt_type = TUNNEL_VXLAN_OPT;
break;
case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS:
+ dst_opt_type = TUNNEL_ERSPAN_OPT;
break;
}
}
@@ -2574,7 +2578,7 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
*/
ip_tunnel_info_opts_set(tun_info,
TUN_METADATA_OPTS(&key, key.tun_opts_len),
- key.tun_opts_len);
+ key.tun_opts_len, dst_opt_type);
add_nested_action_end(*sfa, start);
return err;
--
2.17.1
^ permalink raw reply related
* [PATCH net-next 2/4] net/sched: act_tunnel_key: add extended ack support
From: Jakub Kicinski @ 2018-06-26 18:53 UTC (permalink / raw)
To: davem, jbenc
Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
netdev, Simon Horman, David Ahern, Alexander Aring,
Pieter Jansen van Vuuren
In-Reply-To: <20180626185308.3605-1-jakub.kicinski@netronome.com>
From: Simon Horman <simon.horman@netronome.com>
Add extended ack support for the tunnel key action by using NL_SET_ERR_MSG
during validation of user input.
Cc: David Ahern <dsahern@gmail.com>
Cc: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
net/sched/act_tunnel_key.c | 27 +++++++++++++++++++++------
1 file changed, 21 insertions(+), 6 deletions(-)
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 2edd389e7c92..20e98ed8d498 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -86,16 +86,22 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
int ret = 0;
int err;
- if (!nla)
+ if (!nla) {
+ NL_SET_ERR_MSG(extack, "Tunnel requires attributes to be passed");
return -EINVAL;
+ }
err = nla_parse_nested(tb, TCA_TUNNEL_KEY_MAX, nla, tunnel_key_policy,
- NULL);
- if (err < 0)
+ extack);
+ if (err < 0) {
+ NL_SET_ERR_MSG(extack, "Failed to parse nested tunnel key attributes");
return err;
+ }
- if (!tb[TCA_TUNNEL_KEY_PARMS])
+ if (!tb[TCA_TUNNEL_KEY_PARMS]) {
+ NL_SET_ERR_MSG(extack, "Missing tunnel key parameters");
return -EINVAL;
+ }
parm = nla_data(tb[TCA_TUNNEL_KEY_PARMS]);
exists = tcf_idr_check(tn, parm->index, a, bind);
@@ -107,6 +113,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
break;
case TCA_TUNNEL_KEY_ACT_SET:
if (!tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) {
+ NL_SET_ERR_MSG(extack, "Missing tunnel key id");
ret = -EINVAL;
goto err_out;
}
@@ -144,11 +151,13 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
0, flags,
key_id, 0);
} else {
+ NL_SET_ERR_MSG(extack, "Missing either ipv4 or ipv6 src and dst");
ret = -EINVAL;
goto err_out;
}
if (!metadata) {
+ NL_SET_ERR_MSG(extack, "Cannot allocate tunnel metadata dst");
ret = -ENOMEM;
goto err_out;
}
@@ -156,6 +165,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
break;
default:
+ NL_SET_ERR_MSG(extack, "Unknown tunnel key action");
ret = -EINVAL;
goto err_out;
}
@@ -163,14 +173,18 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
if (!exists) {
ret = tcf_idr_create(tn, parm->index, est, a,
&act_tunnel_key_ops, bind, true);
- if (ret)
+ if (ret) {
+ NL_SET_ERR_MSG(extack, "Cannot create TC IDR");
return ret;
+ }
ret = ACT_P_CREATED;
} else {
tcf_idr_release(*a, bind);
- if (!ovr)
+ if (!ovr) {
+ NL_SET_ERR_MSG(extack, "TC IDR already exists");
return -EEXIST;
+ }
}
t = to_tunnel_key(*a);
@@ -180,6 +194,7 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
if (unlikely(!params_new)) {
if (ret == ACT_P_CREATED)
tcf_idr_release(*a, bind);
+ NL_SET_ERR_MSG(extack, "Cannot allocate tunnel key parameters");
return -ENOMEM;
}
--
2.17.1
^ permalink raw reply related
* [PATCH net-next 1/4] net/sched: act_tunnel_key: disambiguate metadata dst error cases
From: Jakub Kicinski @ 2018-06-26 18:53 UTC (permalink / raw)
To: davem, jbenc
Cc: Roopa Prabhu, jiri, jhs, xiyou.wangcong, daniel, oss-drivers,
netdev, Simon Horman, David Ahern, Alexander Aring
In-Reply-To: <20180626185308.3605-1-jakub.kicinski@netronome.com>
From: Simon Horman <simon.horman@netronome.com>
Metadata may be NULL for one of two reasons:
* Missing user input
* Failure to allocate the metadata dst
Disambiguate these case by returning -EINVAL for the former and -ENOMEM
for the latter rather than -EINVAL for both cases.
This is in preparation for using extended ack to provide more information
to users when parsing their input.
Cc: David Ahern <dsa@cumulusnetworks.com>
Cc: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
net/sched/act_tunnel_key.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/net/sched/act_tunnel_key.c b/net/sched/act_tunnel_key.c
index 626dac81a48a..2edd389e7c92 100644
--- a/net/sched/act_tunnel_key.c
+++ b/net/sched/act_tunnel_key.c
@@ -143,10 +143,13 @@ static int tunnel_key_init(struct net *net, struct nlattr *nla,
metadata = __ipv6_tun_set_dst(&saddr, &daddr, 0, 0, dst_port,
0, flags,
key_id, 0);
+ } else {
+ ret = -EINVAL;
+ goto err_out;
}
if (!metadata) {
- ret = -EINVAL;
+ ret = -ENOMEM;
goto err_out;
}
--
2.17.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox