* Re: [PATCH bpf 0/6] tools: bpftool: fix printf()-like functions
From: Jakub Kicinski @ 2019-08-16 17:51 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Quentin Monnet, Alexei Starovoitov, Daniel Borkmann, bpf,
Network Development, oss-drivers
In-Reply-To: <CAADnVQLPg8jEsUbKOxzQc5Q1BKrB=urSWiniGwsJhcm=UM7oKA@mail.gmail.com>
On Fri, 16 Aug 2019 10:11:12 -0700, Alexei Starovoitov wrote:
> On Fri, Aug 16, 2019 at 9:41 AM Quentin Monnet wrote:
> > 2019-08-15 22:08 UTC-0700 ~ Alexei Starovoitov
> > <alexei.starovoitov@gmail.com>
> > > On Thu, Aug 15, 2019 at 7:32 AM Quentin Monnet
> > > <quentin.monnet@netronome.com> wrote:
> > >>
> > >> Hi,
> > >> Because the "__printf()" attributes were used only where the functions are
> > >> implemented, and not in header files, the checks have not been enforced on
> > >> all the calls to printf()-like functions, and a number of errors slipped in
> > >> bpftool over time.
> > >>
> > >> This set cleans up such errors, and then moves the "__printf()" attributes
> > >> to header files, so that the checks are performed at all locations.
> > >
> > > Applied. Thanks
> > >
> >
> > Thanks Alexei!
> >
> > I noticed the set was applied to the bpf-next tree, and not bpf. Just
> > checking if this is intentional?
>
> Yes. I don't see the _fix_ part in there.
Mm.. these are not critical indeed, but patches 1 and 3 do fix a crash.
Perhaps those should had been a series on their own.
We'll recalibrate :)
> Looks like cleanup to me.
> I've also considered to push
> commit d34b044038bf ("tools: bpftool: close prog FD before exit on
> showing a single program")
> to bpf-next as well.
> That fd leak didn't feel that necessary to push to bpf tree
> and risk merge conflicts... but I pushed it to bpf at the end.
^ permalink raw reply
* Re: Unable to create htb tc classes more than 64K
From: Cong Wang @ 2019-08-16 17:45 UTC (permalink / raw)
To: Akshat Kakkar; +Cc: NetFilter, lartc, netdev
In-Reply-To: <CAA5aLPhf1=wzQG0BAonhR3td-RhEmXaczug8n4hzXCzreb+52g@mail.gmail.com>
On Fri, Aug 16, 2019 at 5:49 AM Akshat Kakkar <akshat.1984@gmail.com> wrote:
>
> I want to have around 1 Million htb tc classes.
> The simple structure of htb tc class, allow having only 64K classes at once.
This is probably due the limit of class ID which is 16bit for minor.
> But, it is possible to make it more hierarchical using hierarchy of
> qdisc and classes.
> For this I tried something like this
>
> tc qdisc add dev eno2 root handle 100: htb
> tc class add dev eno2 parent 100: classid 100:1 htb rate 100Mbps
> tc class add dev eno2 parent 100: classid 100:2 htb rate 100Mbps
>
> tc qdisc add dev eno2 parent 100:1 handle 1: htb
> tc class add dev eno2 parent 1: classid 1:10 htb rate 100kbps
> tc class add dev eno2 parent 1: classid 1:20 htb rate 300kbps
>
> tc qdisc add dev eno2 parent 100:2 handle 2: htb
> tc class add dev eno2 parent 2: classid 2:10 htb rate 100kbps
> tc class add dev eno2 parent 2: classid 2:20 htb rate 300kbps
>
> What I want is something like:
> tc filter add dev eno2 parent 100: protocol ip prio 1 handle
> 0x00000001 fw flowid 1:10
> tc filter add dev eno2 parent 100: protocol ip prio 1 handle
> 0x00000002 fw flowid 1:20
> tc filter add dev eno2 parent 100: protocol ip prio 1 handle
> 0x00000003 fw flowid 2:10
> tc filter add dev eno2 parent 100: protocol ip prio 1 handle
> 0x00000004 fw flowid 2:20
>
> But I am unable to shape my traffic by any of 1:10, 1:20, 2:10 or 2:20.
>
> Can you please suggest, where is it going wrong?
> Is it not possible altogether?
The filter could only filter for classes on the same level, you are
trying to filter for the children classes, which doesn't work.
Thanks.
^ permalink raw reply
* Re: [PATCH bpf 0/6] tools: bpftool: fix printf()-like functions
From: Quentin Monnet @ 2019-08-16 17:18 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Network Development,
oss-drivers
In-Reply-To: <CAADnVQLPg8jEsUbKOxzQc5Q1BKrB=urSWiniGwsJhcm=UM7oKA@mail.gmail.com>
2019-08-16 10:11 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Fri, Aug 16, 2019 at 9:41 AM Quentin Monnet
> <quentin.monnet@netronome.com> wrote:
>>
>> 2019-08-15 22:08 UTC-0700 ~ Alexei Starovoitov
>> <alexei.starovoitov@gmail.com>
>>> On Thu, Aug 15, 2019 at 7:32 AM Quentin Monnet
>>> <quentin.monnet@netronome.com> wrote:
>>>>
>>>> Hi,
>>>> Because the "__printf()" attributes were used only where the functions are
>>>> implemented, and not in header files, the checks have not been enforced on
>>>> all the calls to printf()-like functions, and a number of errors slipped in
>>>> bpftool over time.
>>>>
>>>> This set cleans up such errors, and then moves the "__printf()" attributes
>>>> to header files, so that the checks are performed at all locations.
>>>
>>> Applied. Thanks
>>>
>>
>> Thanks Alexei!
>>
>> I noticed the set was applied to the bpf-next tree, and not bpf. Just
>> checking if this is intentional?
>
> Yes. I don't see the _fix_ part in there.
> Looks like cleanup to me.
> I've also considered to push
> commit d34b044038bf ("tools: bpftool: close prog FD before exit on
> showing a single program")
> to bpf-next as well.
> That fd leak didn't feel that necessary to push to bpf tree
> and risk merge conflicts... but I pushed it to bpf at the end.
>
Ok, thanks for explaining. I'll consider submitting this kind of patches
to bpf-next instead in the future.
Quentin
^ permalink raw reply
* Re: [PATCH bpf-next] libbpf: relicense bpf_helpers.h and bpf_endian.h
From: Greg KH @ 2019-08-16 17:15 UTC (permalink / raw)
To: Daniel Borkmann, Jesper Dangaard Brouer
Cc: Andrii Nakryiko, bpf, netdev, andrii.nakryiko, kernel-team,
Michael Holzheu, Naveen N . Rao, David S . Miller,
Michal Rostecki, John Fastabend, Sargun Dhillon
In-Reply-To: <23a87525-acf5-7a7e-b7b6-3c47b9760eeb@iogearbox.net>
On Fri, Aug 16, 2019 at 05:29:27PM +0200, Daniel Borkmann wrote:
> On 8/16/19 2:10 PM, Jesper Dangaard Brouer wrote:
> > On Thu, 15 Aug 2019 22:45:43 -0700
> > Andrii Nakryiko <andriin@fb.com> wrote:
> >
> > > bpf_helpers.h and bpf_endian.h contain useful macros and BPF helper
> > > definitions essential to almost every BPF program. Which makes them
> > > useful not just for selftests. To be able to expose them as part of
> > > libbpf, though, we need them to be dual-licensed as LGPL-2.1 OR
> > > BSD-2-Clause. This patch updates licensing of those two files.
> >
> > I've already ACKed this, and is fine with (LGPL-2.1 OR BSD-2-Clause).
> >
> > I just want to understand, why "BSD-2-Clause" and not "Apache-2.0" ?
> >
> > The original argument was that this needed to be compatible with
> > "Apache-2.0", then why not simply add this in the "OR" ?
>
> It's use is discouraged in the kernel tree, see also LICENSES/dual/Apache-2.0 (below) and
> statement wrt compatibility from https://www.apache.org/licenses/GPL-compatibility.html:
>
> Valid-License-Identifier: Apache-2.0
> SPDX-URL: https://spdx.org/licenses/Apache-2.0.html
> Usage-Guide:
> Do NOT use. The Apache-2.0 is not GPL2 compatible. [...]
That is correct, don't use Apache-2 code in the kernel please. Even as
a dual-license, it's a total mess.
Having this be BSD-2 is actually better, as it should be fine to use
with Apache 2 code, right?
Jesper, do you know of any license that BSD-2 is not compatible with
that is needed?
thanks,
greg k-h
^ permalink raw reply
* Re: [bpf-next,v2] selftests/bpf: fix race in test_tcp_rtt test
From: Stanislav Fomichev @ 2019-08-16 17:14 UTC (permalink / raw)
To: Petar Penkov; +Cc: netdev, bpf, davem, ast, daniel, sdf, Petar Penkov
In-Reply-To: <20190816170825.22500-1-ppenkov.kernel@gmail.com>
On 08/16, Petar Penkov wrote:
> From: Petar Penkov <ppenkov@google.com>
>
> There is a race in this test between receiving the ACK for the
> single-byte packet sent in the test, and reading the values from the
> map.
>
> This patch fixes this by having the client wait until there are no more
> unacknowledged packets.
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Thanks!
>
> Before:
> for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \
> done | grep -c PASSED
> < trimmed error messages >
> 993
>
> After:
> for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \
> done | grep -c PASSED
> 10000
>
> Fixes: b55873984dab ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> ---
> tools/testing/selftests/bpf/test_tcp_rtt.c | 31 ++++++++++++++++++++++
> 1 file changed, 31 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/test_tcp_rtt.c b/tools/testing/selftests/bpf/test_tcp_rtt.c
> index 90c3862f74a8..93916a69823e 100644
> --- a/tools/testing/selftests/bpf/test_tcp_rtt.c
> +++ b/tools/testing/selftests/bpf/test_tcp_rtt.c
> @@ -6,6 +6,7 @@
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <netinet/in.h>
> +#include <netinet/tcp.h>
> #include <pthread.h>
>
> #include <linux/filter.h>
> @@ -34,6 +35,30 @@ static void send_byte(int fd)
> error(1, errno, "Failed to send single byte");
> }
>
> +static int wait_for_ack(int fd, int retries)
> +{
> + struct tcp_info info;
> + socklen_t optlen;
> + int i, err;
> +
> + for (i = 0; i < retries; i++) {
> + optlen = sizeof(info);
> + err = getsockopt(fd, SOL_TCP, TCP_INFO, &info, &optlen);
> + if (err < 0) {
> + log_err("Failed to lookup TCP stats");
> + return err;
> + }
> +
> + if (info.tcpi_unacked == 0)
> + return 0;
> +
> + usleep(10);
> + }
> +
> + log_err("Did not receive ACK");
> + return -1;
> +}
> +
> static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 invoked,
> __u32 dsack_dups, __u32 delivered, __u32 delivered_ce,
> __u32 icsk_retransmits)
> @@ -149,6 +174,11 @@ static int run_test(int cgroup_fd, int server_fd)
> /*icsk_retransmits=*/0);
>
> send_byte(client_fd);
> + if (wait_for_ack(client_fd, 100) < 0) {
> + err = -1;
> + goto close_client_fd;
> + }
> +
>
> err += verify_sk(map_fd, client_fd, "first payload byte",
> /*invoked=*/2,
> @@ -157,6 +187,7 @@ static int run_test(int cgroup_fd, int server_fd)
> /*delivered_ce=*/0,
> /*icsk_retransmits=*/0);
>
> +close_client_fd:
> close(client_fd);
>
> close_bpf_object:
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>
^ permalink raw reply
* Re: [PATCH bpf 0/6] tools: bpftool: fix printf()-like functions
From: Alexei Starovoitov @ 2019-08-16 17:11 UTC (permalink / raw)
To: Quentin Monnet
Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Network Development,
oss-drivers
In-Reply-To: <10602447-213f-fce5-54c7-7952eb3e8712@netronome.com>
On Fri, Aug 16, 2019 at 9:41 AM Quentin Monnet
<quentin.monnet@netronome.com> wrote:
>
> 2019-08-15 22:08 UTC-0700 ~ Alexei Starovoitov
> <alexei.starovoitov@gmail.com>
> > On Thu, Aug 15, 2019 at 7:32 AM Quentin Monnet
> > <quentin.monnet@netronome.com> wrote:
> >>
> >> Hi,
> >> Because the "__printf()" attributes were used only where the functions are
> >> implemented, and not in header files, the checks have not been enforced on
> >> all the calls to printf()-like functions, and a number of errors slipped in
> >> bpftool over time.
> >>
> >> This set cleans up such errors, and then moves the "__printf()" attributes
> >> to header files, so that the checks are performed at all locations.
> >
> > Applied. Thanks
> >
>
> Thanks Alexei!
>
> I noticed the set was applied to the bpf-next tree, and not bpf. Just
> checking if this is intentional?
Yes. I don't see the _fix_ part in there.
Looks like cleanup to me.
I've also considered to push
commit d34b044038bf ("tools: bpftool: close prog FD before exit on
showing a single program")
to bpf-next as well.
That fd leak didn't feel that necessary to push to bpf tree
and risk merge conflicts... but I pushed it to bpf at the end.
^ permalink raw reply
* [bpf-next,v2] selftests/bpf: fix race in test_tcp_rtt test
From: Petar Penkov @ 2019-08-16 17:08 UTC (permalink / raw)
To: netdev, bpf; +Cc: davem, ast, daniel, sdf, Petar Penkov
From: Petar Penkov <ppenkov@google.com>
There is a race in this test between receiving the ACK for the
single-byte packet sent in the test, and reading the values from the
map.
This patch fixes this by having the client wait until there are no more
unacknowledged packets.
Before:
for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \
done | grep -c PASSED
< trimmed error messages >
993
After:
for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \
done | grep -c PASSED
10000
Fixes: b55873984dab ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
Signed-off-by: Petar Penkov <ppenkov@google.com>
---
tools/testing/selftests/bpf/test_tcp_rtt.c | 31 ++++++++++++++++++++++
1 file changed, 31 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_tcp_rtt.c b/tools/testing/selftests/bpf/test_tcp_rtt.c
index 90c3862f74a8..93916a69823e 100644
--- a/tools/testing/selftests/bpf/test_tcp_rtt.c
+++ b/tools/testing/selftests/bpf/test_tcp_rtt.c
@@ -6,6 +6,7 @@
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
+#include <netinet/tcp.h>
#include <pthread.h>
#include <linux/filter.h>
@@ -34,6 +35,30 @@ static void send_byte(int fd)
error(1, errno, "Failed to send single byte");
}
+static int wait_for_ack(int fd, int retries)
+{
+ struct tcp_info info;
+ socklen_t optlen;
+ int i, err;
+
+ for (i = 0; i < retries; i++) {
+ optlen = sizeof(info);
+ err = getsockopt(fd, SOL_TCP, TCP_INFO, &info, &optlen);
+ if (err < 0) {
+ log_err("Failed to lookup TCP stats");
+ return err;
+ }
+
+ if (info.tcpi_unacked == 0)
+ return 0;
+
+ usleep(10);
+ }
+
+ log_err("Did not receive ACK");
+ return -1;
+}
+
static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 invoked,
__u32 dsack_dups, __u32 delivered, __u32 delivered_ce,
__u32 icsk_retransmits)
@@ -149,6 +174,11 @@ static int run_test(int cgroup_fd, int server_fd)
/*icsk_retransmits=*/0);
send_byte(client_fd);
+ if (wait_for_ack(client_fd, 100) < 0) {
+ err = -1;
+ goto close_client_fd;
+ }
+
err += verify_sk(map_fd, client_fd, "first payload byte",
/*invoked=*/2,
@@ -157,6 +187,7 @@ static int run_test(int cgroup_fd, int server_fd)
/*delivered_ce=*/0,
/*icsk_retransmits=*/0);
+close_client_fd:
close(client_fd);
close_bpf_object:
--
2.23.0.rc1.153.gdeed80330f-goog
^ permalink raw reply related
* Re: [PATCH RFC net-next 3/3] net: dsa: mv88e6xxx: setup SERDES irq also for CPU/DSA ports
From: Marek Behun @ 2019-08-16 17:05 UTC (permalink / raw)
To: Vivien Didelot; +Cc: netdev, Andrew Lunn, Vladimir Oltean, Florian Fainelli
In-Reply-To: <20190816122552.GC629@t480s.localdomain>
On Fri, 16 Aug 2019 12:25:52 -0400
Vivien Didelot <vivien.didelot@gmail.com> wrote:
> So now we have mv88e6xxx_setup_port() and mv88e6xxx_port_setup(), which both
> setup a port, differently, at different time. This is definitely error prone.
Hmm. I don't know how much of mv88e6xxx_setup_port() could be moved to
this new port_setup(), since there are other setup functions called in
mv88e6xxx_setup() that can possibly depend on what was done by
mv88e6xxx_setup_port().
Maybe the new DSA operations should be called .after_setup()
and .before_teardown(), and be called just once for the whole switch,
not for each port?
^ permalink raw reply
* Re: [PATCH net-next,v4 08/12] drivers: net: use flow block API
From: Edward Cree @ 2019-08-16 17:00 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: netdev, netfilter-devel
In-Reply-To: <20190816010421.if6mbyl2n3fsujy4@salvia>
On 16/08/2019 02:04, Pablo Neira Ayuso wrote:
> On Wed, Aug 14, 2019 at 05:17:20PM +0100, Edward Cree wrote:
>> TBH I'm still not clear why you need a flow_block per subsystem, rather than
>> just having multiple subsystems feed their offload requests through the same
>> flow_block but with different enum tc_setup_type or enum tc_fl_command or
>> some other indication that this is "netfilter" rather than "tc" asking for a
>> tc_cls_flower_offload.
> In tc, the flow_block is set up by when the ingress qdisc is
> registered. The usual scenario for most drivers is to have one single
> flow_block per registered ingress qdisc, this makes a 1:1 mapping
> between ingress qdisc and flow_block.
>
> Still, you can register two or more ingress qdiscs to make them share
> the same policy via 'tc block'. In that case all those qdiscs use one
> single flow_block. This makes a N:1 mapping between these qdisc
> ingress and the flow_block. This policy applies to all ingress qdiscs
> that are part of the same tc block. By 'tc block', I'm refering to the
> tcf_block structure.
>
> In netfilter, there are ingress basechains that are registered per
> device. Each basechain gets a flow_block by when the basechain is
> registered. Shared blocks as in tcf_block are not yet supported, but
> it should not be hard to extend it to make this work.
>
> To reuse the same flow_block as entry point for all subsystems as your
> propose - assuming offloads for two or more subsystems are in place -
> then all of them would need to have the same block sharing
> configuration, which might not be the case, ie. tc ingress might have
> a eth0 and eth1 use the same policy via flow_block, while netfilter
> might have one basechain for eth0 and another for eth1 (no policy
> sharing).
Thank you, that's very helpful.
>> This really needs a design document explaining what all the bits are, how
>> they fit together, and why they need to be like that.
> I did not design this flow_block abstraction, this concept was already
> in place under a different name and extend it so the ethtool/netfilter
> subsystems to avoid driver code duplication for offloads.
It's more the new implementation that you've created as part of this
extension that I was asking about, although I agree that the
abstraction that already existed is in need of documentation too.
^ permalink raw reply
* Re: regression in ath10k dma allocation
From: Christoph Hellwig @ 2019-08-16 16:43 UTC (permalink / raw)
To: Tobias Klausmann
Cc: kvalo, davem, ath10k, linux-wireless, netdev, linux-kernel,
nicoleotsuka, hch, m.szyprowski, robin.murphy, iommu,
tobias.klausmann
In-Reply-To: <8fe8b415-2d34-0a14-170b-dcb31c162e67@mni.thm.de>
Hi Tobias,
do you have CONFIG_DMA_CMA set in your config? If not please make sure
you have this commit in your testing tree, and if the problem still
persists it would be a little odd and we'd have to dig deeper:
commit dd3dcede9fa0a0b661ac1f24843f4a1b1317fdb6
Author: Nicolin Chen <nicoleotsuka@gmail.com>
Date: Wed May 29 17:54:25 2019 -0700
dma-contiguous: fix !CONFIG_DMA_CMA version of dma_{alloc, free}_contiguous()
^ permalink raw reply
* Re: [PATCH bpf 0/6] tools: bpftool: fix printf()-like functions
From: Quentin Monnet @ 2019-08-16 16:41 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Network Development,
oss-drivers
In-Reply-To: <CAADnVQKpPaZ3wJJwSn=JPML9pWzwy_8G9c0H=ToaaxZEJ8isnQ@mail.gmail.com>
2019-08-15 22:08 UTC-0700 ~ Alexei Starovoitov
<alexei.starovoitov@gmail.com>
> On Thu, Aug 15, 2019 at 7:32 AM Quentin Monnet
> <quentin.monnet@netronome.com> wrote:
>>
>> Hi,
>> Because the "__printf()" attributes were used only where the functions are
>> implemented, and not in header files, the checks have not been enforced on
>> all the calls to printf()-like functions, and a number of errors slipped in
>> bpftool over time.
>>
>> This set cleans up such errors, and then moves the "__printf()" attributes
>> to header files, so that the checks are performed at all locations.
>
> Applied. Thanks
>
Thanks Alexei!
I noticed the set was applied to the bpf-next tree, and not bpf. Just
checking if this is intentional?
Regards,
Quentin
^ permalink raw reply
* Re: [bpf-next] selftests/bpf: fix race in test_tcp_rtt test
From: Petar Penkov @ 2019-08-16 16:37 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: Networking, bpf, davem, Alexei Starovoitov, Daniel Borkmann,
Stanislav Fomichev, Petar Penkov
In-Reply-To: <20190816161302.GQ2820@mini-arch>
On Fri, Aug 16, 2019 at 9:13 AM Stanislav Fomichev <sdf@fomichev.me> wrote:
>
> On 08/16, Petar Penkov wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > There is a race in this test between receiving the ACK for the
> > single-byte packet sent in the test, and reading the values from the
> > map.
> >
> > This patch fixes this by having the client wait until there are no more
> > unacknowledged packets.
> >
> > Before:
> > for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \
> > done | grep -c PASSED
> > < trimmed error messages >
> > 993
> >
> > After:
> > for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \
> > done | grep -c PASSED
> > 10000
> >
> > Fixes: b55873984dab ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > ---
> > tools/testing/selftests/bpf/test_tcp_rtt.c | 31 ++++++++++++++++++++++
> > 1 file changed, 31 insertions(+)
> >
> > diff --git a/tools/testing/selftests/bpf/test_tcp_rtt.c b/tools/testing/selftests/bpf/test_tcp_rtt.c
> > index 90c3862f74a8..2b4754473956 100644
> > --- a/tools/testing/selftests/bpf/test_tcp_rtt.c
> > +++ b/tools/testing/selftests/bpf/test_tcp_rtt.c
> > @@ -6,6 +6,7 @@
> > #include <sys/types.h>
> > #include <sys/socket.h>
> > #include <netinet/in.h>
> > +#include <netinet/tcp.h>
> > #include <pthread.h>
> >
> > #include <linux/filter.h>
> > @@ -34,6 +35,30 @@ static void send_byte(int fd)
> > error(1, errno, "Failed to send single byte");
> > }
> >
> > +static int wait_for_ack(int fd, int retries)
> > +{
> > + struct tcp_info info;
> > + socklen_t optlen;
> > + int i, err;
> > +
> > + for (i = 0; i < retries; i++) {
> > + optlen = sizeof(info);
> > + err = getsockopt(fd, SOL_TCP, TCP_INFO, &info, &optlen);
> > + if (err < 0) {
> > + log_err("Failed to lookup TCP stats");
> > + return err;
> > + }
> > +
> > + if (info.tcpi_unacked == 0)
> > + return 0;
> > +
> > + sleep(1);
> Isn't it too big of a hammer? Maybe usleep(10) here and do x100 retries
> instead?
>
I guess this is more consistent with the time we expect to wait for an
ACK in the test, will change and send out a v2 shortly. Thank you for
your quick review!
> > + }
> > +
> > + log_err("Did not receive ACK");
> > + return -1;
> > +}
> > +
> > static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 invoked,
> > __u32 dsack_dups, __u32 delivered, __u32 delivered_ce,
> > __u32 icsk_retransmits)
> > @@ -149,6 +174,11 @@ static int run_test(int cgroup_fd, int server_fd)
> > /*icsk_retransmits=*/0);
> >
> > send_byte(client_fd);
> > + if (wait_for_ack(client_fd, 5) < 0) {
> > + err = -1;
> > + goto close_client_fd;
> > + }
> > +
> >
> > err += verify_sk(map_fd, client_fd, "first payload byte",
> > /*invoked=*/2,
> > @@ -157,6 +187,7 @@ static int run_test(int cgroup_fd, int server_fd)
> > /*delivered_ce=*/0,
> > /*icsk_retransmits=*/0);
> >
> > +close_client_fd:
> > close(client_fd);
> >
> > close_bpf_object:
> > --
> > 2.23.0.rc1.153.gdeed80330f-goog
> >
^ permalink raw reply
* [PATCH net-next 3/3] net: fec: add support for PTP system timestamping for MDIO devices
From: Hubert Feurstein @ 2019-08-16 16:31 UTC (permalink / raw)
To: netdev, linux-kernel
Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Fugang Duan,
Vladimir Oltean, David S. Miller
In-Reply-To: <20190816163157.25314-1-h.feurstein@gmail.com>
In order to improve the synchronisation precision of phc2sys (from
the linuxptp project) for devices like switches which are attached
to the MDIO bus, it is necessary the get the system timestamps as
close as possible to the access which causes the PTP timestamp
register to be snapshotted in the switch hardware. Usually this is
triggered by an MDIO write access, the snapshotted timestamp is then
transferred by several MDIO reads.
The ptp_read_system_*ts functions already check the ptp_sts pointer.
Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
drivers/net/ethernet/freescale/fec_main.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 2f6057e7335d..60e866631b61 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1815,10 +1815,12 @@ static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
reinit_completion(&fep->mdio_done);
/* start a write op */
+ ptp_read_system_prets(bus->ptp_sts);
writel(FEC_MMFR_ST | FEC_MMFR_OP_WRITE |
FEC_MMFR_PA(mii_id) | FEC_MMFR_RA(regnum) |
FEC_MMFR_TA | FEC_MMFR_DATA(value),
fep->hwp + FEC_MII_DATA);
+ ptp_read_system_postts(bus->ptp_sts);
/* wait for end of transfer */
time_left = wait_for_completion_timeout(&fep->mdio_done,
@@ -2034,6 +2036,7 @@ static int fec_enet_mii_init(struct platform_device *pdev)
pdev->name, fep->dev_id + 1);
fep->mii_bus->priv = fep;
fep->mii_bus->parent = &pdev->dev;
+ fep->mii_bus->ptp_sts_supported = true;
node = of_get_child_by_name(pdev->dev.of_node, "mdio");
err = of_mdiobus_register(fep->mii_bus, node);
--
2.22.1
^ permalink raw reply related
* [PATCH net-next 1/3] net: mdio: add support for passing a PTP system timestamp to the mii_bus driver
From: Hubert Feurstein @ 2019-08-16 16:31 UTC (permalink / raw)
To: netdev, linux-kernel
Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Florian Fainelli,
Heiner Kallweit, Vladimir Oltean, David S. Miller
In-Reply-To: <20190816163157.25314-1-h.feurstein@gmail.com>
In order to improve the synchronisation precision of phc2sys (from
the linuxptp project) for devices like switches which are attached
to the MDIO bus, it is necessary the get the system timestamps as
close as possible to the access which causes the PTP timestamp
register to be snapshotted in the switch hardware. Usually this is
triggered by an MDIO write access, the snapshotted timestamp is then
transferred by several MDIO reads.
This patch adds the required infrastructure to solve the problem described
above.
Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
drivers/net/phy/mdio_bus.c | 105 +++++++++++++++++++++++++++++++++++++
include/linux/mdio.h | 7 +++
include/linux/phy.h | 25 +++++++++
3 files changed, 137 insertions(+)
diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index bd04fe762056..167a21f267fa 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -34,6 +34,7 @@
#include <linux/phy.h>
#include <linux/io.h>
#include <linux/uaccess.h>
+#include <linux/ptp_clock_kernel.h>
#define CREATE_TRACE_POINTS
#include <trace/events/mdio.h>
@@ -697,6 +698,110 @@ int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val)
}
EXPORT_SYMBOL(mdiobus_write);
+/**
+ * __mdiobus_write_sts - Unlocked version of the mdiobus_write_sts function
+ * @bus: the mii_bus struct
+ * @addr: the phy address
+ * @regnum: register number to write
+ * @val: value to write to @regnum
+ * @sts: the ptp system timestamp
+ *
+ * Write a MDIO bus register and request the MDIO bus driver to take the
+ * system timestamps when sts-pointer is valid. When the bus driver doesn't
+ * support this, the timestamps are taken in this function instead.
+ *
+ * In order to improve the synchronisation precision of phc2sys (from
+ * the linuxptp project) for devices like switches which are attached
+ * to the MDIO bus, it is necessary the get the system timestamps as
+ * close as possible to the access which causes the PTP timestamp
+ * register to be snapshotted in the switch hardware. Usually this is
+ * triggered by an MDIO write access, the snapshotted timestamp is then
+ * transferred by several MDIO reads.
+ *
+ * Caller must hold the mdio bus lock.
+ *
+ * NOTE: MUST NOT be called from interrupt context.
+ */
+int __mdiobus_write_sts(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+ struct ptp_system_timestamp *sts)
+{
+ int retval;
+
+ WARN_ON_ONCE(!mutex_is_locked(&bus->mdio_lock));
+
+ if (!bus->ptp_sts_supported)
+ ptp_read_system_prets(sts);
+
+ bus->ptp_sts = sts;
+ retval = __mdiobus_write(bus, addr, regnum, val);
+ bus->ptp_sts = NULL;
+
+ if (!bus->ptp_sts_supported)
+ ptp_read_system_postts(sts);
+
+ return retval;
+}
+EXPORT_SYMBOL(__mdiobus_write_sts);
+
+/**
+ * mdiobus_write_sts - Convenience function for writing a given MII mgmt
+ * register
+ *
+ * @bus: the mii_bus struct
+ * @addr: the phy address
+ * @regnum: register number to write
+ * @val: value to write to @regnum
+ * @sts: the ptp system timestamp
+ *
+ * NOTE: MUST NOT be called from interrupt context,
+ * because the bus read/write functions may wait for an interrupt
+ * to conclude the operation.
+ */
+int mdiobus_write_sts(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+ struct ptp_system_timestamp *sts)
+{
+ int retval;
+
+ BUG_ON(in_interrupt());
+
+ mutex_lock(&bus->mdio_lock);
+ retval = __mdiobus_write_sts(bus, addr, regnum, val, sts);
+ mutex_unlock(&bus->mdio_lock);
+
+ return retval;
+}
+EXPORT_SYMBOL(mdiobus_write_sts);
+
+/**
+ * mdiobus_write_sts_nested - Nested version of the mdiobus_write_sts function
+ * @bus: the mii_bus struct
+ * @addr: the phy address
+ * @regnum: register number to write
+ * @val: value to write to @regnum
+ * @sts: the ptp system timestamp
+ *
+ * In case of nested MDIO bus access avoid lockdep false positives by
+ * using mutex_lock_nested().
+ *
+ * NOTE: MUST NOT be called from interrupt context,
+ * because the bus read/write functions may wait for an interrupt
+ * to conclude the operation.
+ */
+int mdiobus_write_sts_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+ struct ptp_system_timestamp *sts)
+{
+ int retval;
+
+ BUG_ON(in_interrupt());
+
+ mutex_lock_nested(&bus->mdio_lock, MDIO_MUTEX_NESTED);
+ retval = __mdiobus_write_sts(bus, addr, regnum, val, sts);
+ mutex_unlock(&bus->mdio_lock);
+
+ return retval;
+}
+EXPORT_SYMBOL(mdiobus_write_sts_nested);
+
/**
* mdio_bus_match - determine if given MDIO driver supports the given
* MDIO device
diff --git a/include/linux/mdio.h b/include/linux/mdio.h
index e8242ad88c81..d65625c75b15 100644
--- a/include/linux/mdio.h
+++ b/include/linux/mdio.h
@@ -9,6 +9,7 @@
#include <uapi/linux/mdio.h>
#include <linux/mod_devicetable.h>
+struct ptp_system_timestamp;
struct gpio_desc;
struct mii_bus;
@@ -305,11 +306,17 @@ static inline void mii_10gbt_stat_mod_linkmode_lpa_t(unsigned long *advertising,
int __mdiobus_read(struct mii_bus *bus, int addr, u32 regnum);
int __mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val);
+int __mdiobus_write_sts(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+ struct ptp_system_timestamp *sts);
int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum);
int mdiobus_read_nested(struct mii_bus *bus, int addr, u32 regnum);
int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val);
int mdiobus_write_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val);
+int mdiobus_write_sts(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+ struct ptp_system_timestamp *sts);
+int mdiobus_write_sts_nested(struct mii_bus *bus, int addr, u32 regnum, u16 val,
+ struct ptp_system_timestamp *sts);
int mdiobus_register_device(struct mdio_device *mdiodev);
int mdiobus_unregister_device(struct mdio_device *mdiodev);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 462b90b73f93..15afe9c5256b 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -252,6 +252,31 @@ struct mii_bus {
int reset_delay_us;
/* RESET GPIO descriptor pointer */
struct gpio_desc *reset_gpiod;
+
+ /* PTP system timestamping support
+ *
+ * In order to improve the synchronisation precision of phc2sys (from
+ * the linuxptp project) for devices like switches which are attached
+ * to the MDIO bus, it is necessary the get the system timestamps as
+ * close as possible to the access which causes the PTP timestamp
+ * register to be snapshotted in the switch hardware. Usually this is
+ * triggered by an MDIO write access, the snapshotted timestamp is then
+ * transferred by several MDIO reads.
+ *
+ * The switch driver can use mdio_write_sts*() to pass through the
+ * system timestamp pointer @ptp_sts to the MDIO bus driver. The bus
+ * driver simply has to do the following calls in its write handler:
+ * ptp_read_system_prets(bus->ptp_sts);
+ * writel(value, mdio-register)
+ * ptp_read_system_postts(bus->ptp_sts);
+ *
+ * The ptp_read_system_*ts functions already check the ptp_sts pointer.
+ *
+ * @ptp_sts_supported: Must be set to true when the MDIO bus driver
+ * takes the timestamps as described above.
+ */
+ struct ptp_system_timestamp *ptp_sts;
+ bool ptp_sts_supported;
};
#define to_mii_bus(d) container_of(d, struct mii_bus, dev)
--
2.22.1
^ permalink raw reply related
* [PATCH net-next 2/3] net: dsa: mv88e6xxx: extend PTP gettime function to read system clock
From: Hubert Feurstein @ 2019-08-16 16:31 UTC (permalink / raw)
To: netdev, linux-kernel
Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Vivien Didelot,
Florian Fainelli, Vladimir Oltean, David S. Miller
In-Reply-To: <20190816163157.25314-1-h.feurstein@gmail.com>
This adds support for the PTP_SYS_OFFSET_EXTENDED ioctl.
Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>
---
drivers/net/dsa/mv88e6xxx/chip.h | 2 ++
drivers/net/dsa/mv88e6xxx/ptp.c | 11 +++++++----
drivers/net/dsa/mv88e6xxx/smi.c | 3 ++-
3 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index 01963ee94c50..9e14dc406415 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -277,6 +277,8 @@ struct mv88e6xxx_chip {
struct ptp_clock_info ptp_clock_info;
struct delayed_work tai_event_work;
struct ptp_pin_desc pin_config[MV88E6XXX_MAX_GPIO];
+ struct ptp_system_timestamp *ptp_sts;
+
u16 trig_config;
u16 evcap_config;
u16 enable_count;
diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
index 073cbd0bb91b..cf6e52ee9e0a 100644
--- a/drivers/net/dsa/mv88e6xxx/ptp.c
+++ b/drivers/net/dsa/mv88e6xxx/ptp.c
@@ -235,14 +235,17 @@ static int mv88e6xxx_ptp_adjtime(struct ptp_clock_info *ptp, s64 delta)
return 0;
}
-static int mv88e6xxx_ptp_gettime(struct ptp_clock_info *ptp,
- struct timespec64 *ts)
+static int mv88e6xxx_ptp_gettimex(struct ptp_clock_info *ptp,
+ struct timespec64 *ts,
+ struct ptp_system_timestamp *sts)
{
struct mv88e6xxx_chip *chip = ptp_to_chip(ptp);
u64 ns;
mv88e6xxx_reg_lock(chip);
+ chip->ptp_sts = sts;
ns = timecounter_read(&chip->tstamp_tc);
+ chip->ptp_sts = NULL;
mv88e6xxx_reg_unlock(chip);
*ts = ns_to_timespec64(ns);
@@ -426,7 +429,7 @@ static void mv88e6xxx_ptp_overflow_check(struct work_struct *work)
struct mv88e6xxx_chip *chip = dw_overflow_to_chip(dw);
struct timespec64 ts;
- mv88e6xxx_ptp_gettime(&chip->ptp_clock_info, &ts);
+ mv88e6xxx_ptp_gettimex(&chip->ptp_clock_info, &ts, NULL);
schedule_delayed_work(&chip->overflow_work,
MV88E6XXX_TAI_OVERFLOW_PERIOD);
@@ -472,7 +475,7 @@ int mv88e6xxx_ptp_setup(struct mv88e6xxx_chip *chip)
chip->ptp_clock_info.max_adj = MV88E6XXX_MAX_ADJ_PPB;
chip->ptp_clock_info.adjfine = mv88e6xxx_ptp_adjfine;
chip->ptp_clock_info.adjtime = mv88e6xxx_ptp_adjtime;
- chip->ptp_clock_info.gettime64 = mv88e6xxx_ptp_gettime;
+ chip->ptp_clock_info.gettimex64 = mv88e6xxx_ptp_gettimex;
chip->ptp_clock_info.settime64 = mv88e6xxx_ptp_settime;
chip->ptp_clock_info.enable = ptp_ops->ptp_enable;
chip->ptp_clock_info.verify = ptp_ops->ptp_verify;
diff --git a/drivers/net/dsa/mv88e6xxx/smi.c b/drivers/net/dsa/mv88e6xxx/smi.c
index 5fc78a063843..e3b0096a9d94 100644
--- a/drivers/net/dsa/mv88e6xxx/smi.c
+++ b/drivers/net/dsa/mv88e6xxx/smi.c
@@ -45,7 +45,8 @@ static int mv88e6xxx_smi_direct_write(struct mv88e6xxx_chip *chip,
{
int ret;
- ret = mdiobus_write_nested(chip->bus, dev, reg, data);
+ ret = mdiobus_write_sts_nested(chip->bus, dev, reg, data,
+ chip->ptp_sts);
if (ret < 0)
return ret;
--
2.22.1
^ permalink raw reply related
* [PATCH net-next 0/3] Improve phc2sys precision for mv88e6xxx switch in combination with imx6-fec
From: Hubert Feurstein @ 2019-08-16 16:31 UTC (permalink / raw)
To: netdev, linux-kernel
Cc: Hubert Feurstein, Andrew Lunn, Richard Cochran, Vivien Didelot,
Florian Fainelli, Heiner Kallweit, Vladimir Oltean, Fugang Duan,
David S. Miller
With this patchset the phc2sys synchronisation precision improved to +/-555ns on an IMX6DL with an MV88E6220 switch attached.
This patchset takes into account the comments from the following discussions:
- https://lkml.org/lkml/2019/8/2/1364
- https://lkml.org/lkml/2019/8/5/169
Patch 01 adds the required infrastructure in the MDIO layer.
Patch 02 adds support for the PTP_SYS_OFFSET_EXTENDED ioctl in the mv88e6xxx driver.
Patch 03 adds support for the PTP system timestamping in the imx-fec driver.
The following tests show the improvement caused by each patch. The system clock precision was set to 15ns instead of 333ns (as described in https://lkml.org/lkml/2019/8/2/1364).
Without this patchset applied, the phc2sys synchronisation performance was very poor:
offset: min -27120 max 28840 mean 2.44 stddev 8040.78 count 1236
delay: min 282103 max 386385 mean 352568.03 stddev 27814.27 count 1236
(test runtime 20 minutes)
Results after appling patch 01 and 02:
offset: min -12316 max 13314 mean -9.38 stddev 4274.82 count 1022
delay: min 69977 max 96266 mean 87939.04 stddev 6466.17 count 1022
(test runtime 16 minutes)
Results after appling patch 03:
offset: min -788 max 528 mean -0.06 stddev 185.02 count 7171
delay: min 1773 max 2031 mean 1909.43 stddev 33.74 count 7171
(test runtime 119 minutes)
Hubert Feurstein (3):
net: mdio: add support for passing a PTP system timestamp to the
mii_bus driver
net: dsa: mv88e6xxx: extend PTP gettime function to read system clock
net: fec: add support for PTP system timestamping for MDIO devices
drivers/net/dsa/mv88e6xxx/chip.h | 2 +
drivers/net/dsa/mv88e6xxx/ptp.c | 11 ++-
drivers/net/dsa/mv88e6xxx/smi.c | 3 +-
drivers/net/ethernet/freescale/fec_main.c | 3 +
drivers/net/phy/mdio_bus.c | 105 ++++++++++++++++++++++
include/linux/mdio.h | 7 ++
include/linux/phy.h | 25 ++++++
7 files changed, 151 insertions(+), 5 deletions(-)
--
2.22.1
^ permalink raw reply
* regression in ath10k dma allocation
From: Tobias Klausmann @ 2019-08-16 16:08 UTC (permalink / raw)
To: kvalo, davem, ath10k, linux-wireless, netdev, linux-kernel,
nicoleotsuka, hch, m.szyprowski, robin.murphy, iommu
Cc: tobias.klausmann
Hello all,
within the current development cycle i noted the ath10k driver failing
to setup:
[ 3.185660] ath10k_pci 0000:02:00.0: failed to alloc CE dest ring 1: -12
[ 3.185664] ath10k_pci 0000:02:00.0: failed to allocate copy engine
pipe 1: -12
[ 3.185667] ath10k_pci 0000:02:00.0: failed to allocate copy engine
pipes: -12
[ 3.185669] ath10k_pci 0000:02:00.0: failed to setup resource: -12
[ 3.185692] ath10k_pci: probe of 0000:02:00.0 failed with error -12
the actual failure comes from [1] and indeed bisecting brought me to a
related commit "dma-contiguous: add dma_{alloc,free}_contiguous()
helpers" [2]. Reverting the commit fixes the problem, yet this might
just be the driver abusing the dma infrastructure, so hopefully someone
can have a look at it, as i'm not familiar with the code!
[1]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/wireless/ath/ath10k/ce.c?h=v5.3-rc4#n1650
[2]:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b1d2dc009dece4cd7e629419b52266ba51960a6b
Greetings,
Tobias
^ permalink raw reply
* Re: [PATCH RFC net-next 3/3] net: dsa: mv88e6xxx: setup SERDES irq also for CPU/DSA ports
From: Vivien Didelot @ 2019-08-16 16:25 UTC (permalink / raw)
To: Marek Behún
Cc: netdev, Andrew Lunn, Vladimir Oltean, Florian Fainelli,
Marek Behún
In-Reply-To: <20190816150834.26939-4-marek.behun@nic.cz>
On Fri, 16 Aug 2019 17:08:34 +0200, Marek Behún <marek.behun@nic.cz> wrote:
> @@ -2151,16 +2151,6 @@ static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port)
> if (err)
> return err;
>
> - /* Enable the SERDES interface for DSA and CPU ports. Normal
> - * ports SERDES are enabled when the port is enabled, thus
> - * saving a bit of power.
> - */
> - if ((dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))) {
> - err = mv88e6xxx_serdes_power(chip, port, true);
> - if (err)
> - return err;
> - }
> -
> /* Port Control 2: don't force a good FCS, set the maximum frame size to
> * 10240 bytes, disable 802.1q tags checking, don't discard tagged or
> * untagged frames on this port, do a destination address lookup on all
> @@ -2557,6 +2547,48 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
> return err;
> }
>
> +static int mv88e6xxx_port_setup(struct dsa_switch *ds, int port)
> +{
> + struct mv88e6xxx_chip *chip = ds->priv;
> + int err;
> +
> + /* Enable the SERDES interface for DSA and CPU ports. Normal
> + * ports SERDES are enabled when the port is enabled, thus
> + * saving a bit of power.
> + */
> + if ((dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))) {
> + mv88e6xxx_reg_lock(chip);
> +
> + err = mv88e6xxx_serdes_power(chip, port, true);
> +
> + if (!err && chip->info->ops->serdes_irq_setup)
> + err = chip->info->ops->serdes_irq_setup(chip, port);
> +
> + mv88e6xxx_reg_unlock(chip);
> +
> + return err;
> + }
> +
> + return 0;
> +}
> +
> +static void mv88e6xxx_port_teardown(struct dsa_switch *ds, int port)
> +{
> + struct mv88e6xxx_chip *chip = ds->priv;
> +
> + if ((dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))) {
> + mv88e6xxx_reg_lock(chip);
> +
> + if (chip->info->ops->serdes_irq_free)
> + chip->info->ops->serdes_irq_free(chip, port);
> +
> + if (mv88e6xxx_serdes_power(chip, port, false))
> + dev_err(chip->dev, "failed to power off SERDES\n");
> +
> + mv88e6xxx_reg_unlock(chip);
> + }
> +}
So now we have mv88e6xxx_setup_port() and mv88e6xxx_port_setup(), which both
setup a port, differently, at different time. This is definitely error prone.
^ permalink raw reply
* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Stanislav Fomichev @ 2019-08-16 16:20 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <20190816155911.GP2820@mini-arch>
On 08/16, Stanislav Fomichev wrote:
> On 08/15, Jakub Kicinski wrote:
> > On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
> > > On 08/15, Toshiaki Makita wrote:
> > > > On 2019/08/15 2:07, Stanislav Fomichev wrote:
> > > > > On 08/13, Toshiaki Makita wrote:
> > > > > > * Implementation
> > > > > >
> > > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > > first packet of ping in above test will incease, which I want to avoid.
> > > > > Can this be instead implemented with a new hook that will be called
> > > > > for TC events? This hook can write to perf event buffer and control
> > > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > > plane will also install xdp program).
> > > > >
> > > > > Why do we need UMH? What am I missing?
> > > >
> > > > So you suggest doing everything in xdp_flow kmod?
> > > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > > (bypass) that dumps every command via netlink (or calls the BPF hook
> > > where you can dump it into perf event buffer) and then read that info
> > > from userspace and install xdp programs and modify flow tables.
> > > I don't think you need any kernel changes besides that stream
> > > of data from the kernel about qdisc/tc flow creation/removal/etc.
> >
> > There's a certain allure in bringing the in-kernel BPF translation
> > infrastructure forward. OTOH from system architecture perspective IMHO
> > it does seem like a task best handed in user space. bpfilter can replace
> > iptables completely, here we're looking at an acceleration relatively
> > loosely coupled with flower.
> Even for bpfilter I would've solved it using something similar:
> iptables bypass + redirect iptables netlink requests to some
> userspace helper that was registered to be iptables compatibility
> manager. And then, again, it becomes a purely userspace problem.
Oh, wait, isn't iptables kernel api is setsockopt/getsockopt?
With the new cgroup hooks you can now try to do bpfilter completely
in BPF 🤯
> The issue with UMH is that the helper has to be statically compiled
> from the kernel tree, which means we can't bring in any dependencies
> (stuff like libkefir you mentioned below).
>
> But I digress :-)
>
> > FWIW Quentin spent some time working on a universal flow rule to BPF
> > translation library:
> >
> > https://github.com/Netronome/libkefir
> >
> > A lot remains to be done there, but flower front end is one of the
> > targets. A library can be tuned for any application, without a
> > dependency on flower uAPI.
> >
> > > But, I haven't looked at the series deeply, so I might be missing
> > > something :-)
> >
> > I don't think you are :)
^ permalink raw reply
* RE: [PATCH net-next, 2/6] PCI: hv: Add a Hyper-V PCI mini driver for software backchannel interface
From: Vitaly Kuznetsov @ 2019-08-16 16:16 UTC (permalink / raw)
To: Haiyang Zhang
Cc: KY Srinivasan, Stephen Hemminger, linux-kernel@vger.kernel.org,
sashal@kernel.org, davem@davemloft.net, saeedm@mellanox.com,
leon@kernel.org, eranbe@mellanox.com, lorenzo.pieralisi@arm.com,
bhelgaas@google.com, linux-pci@vger.kernel.org,
linux-hyperv@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <DM6PR21MB13375FA0BA0220A91EF448E1CAAF0@DM6PR21MB1337.namprd21.prod.outlook.com>
Haiyang Zhang <haiyangz@microsoft.com> writes:
>
> The pci_hyperv can only be loaded on VMs on Hyper-V and Azure. Other
> drivers like MLX5e will have symbolic dependency of pci_hyperv if they
> use functions exported by pci_hyperv. This dependency will cause other
> drivers fail to load on other platforms, like VMs on KVM. So we created
> this mini driver, which can be loaded on any platforms to provide the
> symbolic dependency.
(/me wondering is there a nicer way around this, by using __weak or
something like that...)
In case this stub is the best solution I'd suggest to rename it to
something like PCI_HYPERV_INTERFACE to make it clear it is not a
separate driver (_MINI makes me think so).
--
Vitaly
^ permalink raw reply
* Re: [bpf-next] selftests/bpf: fix race in test_tcp_rtt test
From: Stanislav Fomichev @ 2019-08-16 16:13 UTC (permalink / raw)
To: Petar Penkov; +Cc: netdev, bpf, davem, ast, daniel, sdf, Petar Penkov
In-Reply-To: <20190816160339.249832-1-ppenkov.kernel@gmail.com>
On 08/16, Petar Penkov wrote:
> From: Petar Penkov <ppenkov@google.com>
>
> There is a race in this test between receiving the ACK for the
> single-byte packet sent in the test, and reading the values from the
> map.
>
> This patch fixes this by having the client wait until there are no more
> unacknowledged packets.
>
> Before:
> for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \
> done | grep -c PASSED
> < trimmed error messages >
> 993
>
> After:
> for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \
> done | grep -c PASSED
> 10000
>
> Fixes: b55873984dab ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
> Signed-off-by: Petar Penkov <ppenkov@google.com>
> ---
> tools/testing/selftests/bpf/test_tcp_rtt.c | 31 ++++++++++++++++++++++
> 1 file changed, 31 insertions(+)
>
> diff --git a/tools/testing/selftests/bpf/test_tcp_rtt.c b/tools/testing/selftests/bpf/test_tcp_rtt.c
> index 90c3862f74a8..2b4754473956 100644
> --- a/tools/testing/selftests/bpf/test_tcp_rtt.c
> +++ b/tools/testing/selftests/bpf/test_tcp_rtt.c
> @@ -6,6 +6,7 @@
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <netinet/in.h>
> +#include <netinet/tcp.h>
> #include <pthread.h>
>
> #include <linux/filter.h>
> @@ -34,6 +35,30 @@ static void send_byte(int fd)
> error(1, errno, "Failed to send single byte");
> }
>
> +static int wait_for_ack(int fd, int retries)
> +{
> + struct tcp_info info;
> + socklen_t optlen;
> + int i, err;
> +
> + for (i = 0; i < retries; i++) {
> + optlen = sizeof(info);
> + err = getsockopt(fd, SOL_TCP, TCP_INFO, &info, &optlen);
> + if (err < 0) {
> + log_err("Failed to lookup TCP stats");
> + return err;
> + }
> +
> + if (info.tcpi_unacked == 0)
> + return 0;
> +
> + sleep(1);
Isn't it too big of a hammer? Maybe usleep(10) here and do x100 retries
instead?
> + }
> +
> + log_err("Did not receive ACK");
> + return -1;
> +}
> +
> static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 invoked,
> __u32 dsack_dups, __u32 delivered, __u32 delivered_ce,
> __u32 icsk_retransmits)
> @@ -149,6 +174,11 @@ static int run_test(int cgroup_fd, int server_fd)
> /*icsk_retransmits=*/0);
>
> send_byte(client_fd);
> + if (wait_for_ack(client_fd, 5) < 0) {
> + err = -1;
> + goto close_client_fd;
> + }
> +
>
> err += verify_sk(map_fd, client_fd, "first payload byte",
> /*invoked=*/2,
> @@ -157,6 +187,7 @@ static int run_test(int cgroup_fd, int server_fd)
> /*delivered_ce=*/0,
> /*icsk_retransmits=*/0);
>
> +close_client_fd:
> close(client_fd);
>
> close_bpf_object:
> --
> 2.23.0.rc1.153.gdeed80330f-goog
>
^ permalink raw reply
* [bpf-next] selftests/bpf: fix race in test_tcp_rtt test
From: Petar Penkov @ 2019-08-16 16:03 UTC (permalink / raw)
To: netdev, bpf; +Cc: davem, ast, daniel, sdf, Petar Penkov
From: Petar Penkov <ppenkov@google.com>
There is a race in this test between receiving the ACK for the
single-byte packet sent in the test, and reading the values from the
map.
This patch fixes this by having the client wait until there are no more
unacknowledged packets.
Before:
for i in {1..1000}; do ../net/in_netns.sh ./test_tcp_rtt; \
done | grep -c PASSED
< trimmed error messages >
993
After:
for i in {1..10000}; do ../net/in_netns.sh ./test_tcp_rtt; \
done | grep -c PASSED
10000
Fixes: b55873984dab ("selftests/bpf: test BPF_SOCK_OPS_RTT_CB")
Signed-off-by: Petar Penkov <ppenkov@google.com>
---
tools/testing/selftests/bpf/test_tcp_rtt.c | 31 ++++++++++++++++++++++
1 file changed, 31 insertions(+)
diff --git a/tools/testing/selftests/bpf/test_tcp_rtt.c b/tools/testing/selftests/bpf/test_tcp_rtt.c
index 90c3862f74a8..2b4754473956 100644
--- a/tools/testing/selftests/bpf/test_tcp_rtt.c
+++ b/tools/testing/selftests/bpf/test_tcp_rtt.c
@@ -6,6 +6,7 @@
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
+#include <netinet/tcp.h>
#include <pthread.h>
#include <linux/filter.h>
@@ -34,6 +35,30 @@ static void send_byte(int fd)
error(1, errno, "Failed to send single byte");
}
+static int wait_for_ack(int fd, int retries)
+{
+ struct tcp_info info;
+ socklen_t optlen;
+ int i, err;
+
+ for (i = 0; i < retries; i++) {
+ optlen = sizeof(info);
+ err = getsockopt(fd, SOL_TCP, TCP_INFO, &info, &optlen);
+ if (err < 0) {
+ log_err("Failed to lookup TCP stats");
+ return err;
+ }
+
+ if (info.tcpi_unacked == 0)
+ return 0;
+
+ sleep(1);
+ }
+
+ log_err("Did not receive ACK");
+ return -1;
+}
+
static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 invoked,
__u32 dsack_dups, __u32 delivered, __u32 delivered_ce,
__u32 icsk_retransmits)
@@ -149,6 +174,11 @@ static int run_test(int cgroup_fd, int server_fd)
/*icsk_retransmits=*/0);
send_byte(client_fd);
+ if (wait_for_ack(client_fd, 5) < 0) {
+ err = -1;
+ goto close_client_fd;
+ }
+
err += verify_sk(map_fd, client_fd, "first payload byte",
/*invoked=*/2,
@@ -157,6 +187,7 @@ static int run_test(int cgroup_fd, int server_fd)
/*delivered_ce=*/0,
/*icsk_retransmits=*/0);
+close_client_fd:
close(client_fd);
close_bpf_object:
--
2.23.0.rc1.153.gdeed80330f-goog
^ permalink raw reply related
* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Stanislav Fomichev @ 2019-08-16 15:59 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <20190815122232.4b1fa01c@cakuba.netronome.com>
On 08/15, Jakub Kicinski wrote:
> On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
> > On 08/15, Toshiaki Makita wrote:
> > > On 2019/08/15 2:07, Stanislav Fomichev wrote:
> > > > On 08/13, Toshiaki Makita wrote:
> > > > > * Implementation
> > > > >
> > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > first packet of ping in above test will incease, which I want to avoid.
> > > > Can this be instead implemented with a new hook that will be called
> > > > for TC events? This hook can write to perf event buffer and control
> > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > plane will also install xdp program).
> > > >
> > > > Why do we need UMH? What am I missing?
> > >
> > > So you suggest doing everything in xdp_flow kmod?
> > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > (bypass) that dumps every command via netlink (or calls the BPF hook
> > where you can dump it into perf event buffer) and then read that info
> > from userspace and install xdp programs and modify flow tables.
> > I don't think you need any kernel changes besides that stream
> > of data from the kernel about qdisc/tc flow creation/removal/etc.
>
> There's a certain allure in bringing the in-kernel BPF translation
> infrastructure forward. OTOH from system architecture perspective IMHO
> it does seem like a task best handed in user space. bpfilter can replace
> iptables completely, here we're looking at an acceleration relatively
> loosely coupled with flower.
Even for bpfilter I would've solved it using something similar:
iptables bypass + redirect iptables netlink requests to some
userspace helper that was registered to be iptables compatibility
manager. And then, again, it becomes a purely userspace problem.
The issue with UMH is that the helper has to be statically compiled
from the kernel tree, which means we can't bring in any dependencies
(stuff like libkefir you mentioned below).
But I digress :-)
> FWIW Quentin spent some time working on a universal flow rule to BPF
> translation library:
>
> https://github.com/Netronome/libkefir
>
> A lot remains to be done there, but flower front end is one of the
> targets. A library can be tuned for any application, without a
> dependency on flower uAPI.
>
> > But, I haven't looked at the series deeply, so I might be missing
> > something :-)
>
> I don't think you are :)
^ permalink raw reply
* Re: [PATCH bpf-next] libbpf: remove zc variable as it is not used
From: Yonghong Song @ 2019-08-16 15:37 UTC (permalink / raw)
To: Magnus Karlsson, bjorn.topel@intel.com, ast@kernel.org,
daniel@iogearbox.net, netdev@vger.kernel.org
Cc: bpf@vger.kernel.org
In-Reply-To: <1565951171-14439-1-git-send-email-magnus.karlsson@intel.com>
On 8/16/19 3:26 AM, Magnus Karlsson wrote:
> The zc is not used in the xsk part of libbpf, so let us remove it. Not
> good to have dead code lying around.
>
> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
> Reported-by: Yonghong Song <yhs@fb.com> > ---
> tools/lib/bpf/xsk.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> index 680e630..9687da9 100644
> --- a/tools/lib/bpf/xsk.c
> +++ b/tools/lib/bpf/xsk.c
> @@ -65,7 +65,6 @@ struct xsk_socket {
> int xsks_map_fd;
> __u32 queue_id;
> char ifname[IFNAMSIZ];
> - bool zc;
> };
>
> struct xsk_nl_info {
> @@ -608,8 +607,6 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
> goto out_mmap_tx;
> }
>
> - xsk->zc = opts.flags & XDP_OPTIONS_ZEROCOPY;
Since opts.flags usage is removed. Do you think it makes sense to
remove
optlen = sizeof(opts);
err = getsockopt(xsk->fd, SOL_XDP, XDP_OPTIONS, &opts, &optlen);
if (err) {
err = -errno;
goto out_mmap_tx;
}
as well since nobody then uses opts?
> -
> if (!(xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)) {
> err = xsk_setup_xdp_prog(xsk);
> if (err)
>
^ permalink raw reply
* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
From: Stanislav Fomichev @ 2019-08-16 15:35 UTC (permalink / raw)
To: Toshiaki Makita
Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
Yonghong Song, David S. Miller, Jakub Kicinski,
Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
Cong Wang, Jiri Pirko, netdev, bpf, William Tu
In-Reply-To: <4614fefc-fc43-8cf7-d064-7dc1947acc6c@gmail.com>
On 08/16, Toshiaki Makita wrote:
> On 2019/08/16 0:21, Stanislav Fomichev wrote:
> > On 08/15, Toshiaki Makita wrote:
> > > On 2019/08/15 2:07, Stanislav Fomichev wrote:
> > > > On 08/13, Toshiaki Makita wrote:
> > > > > * Implementation
> > > > >
> > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > first packet of ping in above test will incease, which I want to avoid.
> > > > Can this be instead implemented with a new hook that will be called
> > > > for TC events? This hook can write to perf event buffer and control
> > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > plane will also install xdp program).
> > > >
> > > > Why do we need UMH? What am I missing?
> > >
> > > So you suggest doing everything in xdp_flow kmod?
> > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > (bypass) that dumps every command via netlink (or calls the BPF hook
> > where you can dump it into perf event buffer) and then read that info
> > from userspace and install xdp programs and modify flow tables.
> > I don't think you need any kernel changes besides that stream
> > of data from the kernel about qdisc/tc flow creation/removal/etc.
>
> My intention is to make more people who want high speed network easily use XDP,
> so making transparent XDP offload with current TC interface.
>
> What userspace program would monitor TC events with your suggestion?
Have a new system daemon (xdpflowerd) that is independently
packaged/shipped/installed. Anybody who wants accelerated TC can
download/install it. OVS can be completely unaware of this.
> ovs-vswitchd? If so, it even does not need to monitor TC. It can
> implement XDP offload directly.
> (However I prefer kernel solution. Please refer to "About alternative
> userland (ovs-vswitchd etc.) implementation" section in the cover letter.)
>
> Also such a TC monitoring solution easily can be out-of-sync with real TC
> behavior as TC filter/flower is being heavily developed and changed,
> e.g. introduction of TC block, support multiple masks with the same pref, etc.
> I'm not sure such an unreliable solution have much value.
This same issue applies to the in-kernel implementation, isn't it?
What happens if somebody sends patches for a new flower feature but
doesn't add appropriate xdp support? Do we reject them?
That's why I'm suggesting to move this problem to the userspace :-)
> > But, I haven't looked at the series deeply, so I might be missing
> > something :-)
> >
> > > I also thought about that. There are two phases so let's think about them separately.
> > >
> > > 1) TC block (qdisc) creation / eBPF load
> > >
> > > I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
> > > done from userland, not from kernel, to run the verifier for safety.
> > > However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
> > > allow such programs to be loaded from kernel? I currently don't have the will
> > > to make such an API as loading can be done with current UMH mechanism.
> > >
> > > 2) flow insertion / eBPF map update
> > >
> > > Not sure if this needs to be done from userland. One concern is that eBPF maps can
> > > be modified by unrelated processes and we need to handle all unexpected state of maps.
> > > Such handling tends to be difficult and may cause unexpected kernel behavior.
> > > OTOH updating maps from kmod may reduces the latency of flow insertion drastically.
> > Latency from the moment I type 'tc filter add ...' to the moment the rule
> > is installed into the maps? Does it really matter?
>
> Yes it matters. Flow insertion is kind of data path in OVS.
> Please see how ping latency is affected in the cover letter.
Ok, but what I'm suggesting shouldn't be less performant.
We are talking about UMH writing into a pipe vs writing TC events into
a netlink.
> > Do I understand correctly that both of those events (qdisc creation and
> > flow insertion) are triggered from tcf_block_offload_cmd (or similar)?
>
> Both of eBPF load and map update are triggered from tcf_block_offload_cmd.
> I think you understand it correctly.
>
> Toshiaki Makita
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox