Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/1] netfilter: nf_tables: fib: Drop IPV6 packages if IPv6 is disabled on boot
From: Leonardo Bras @ 2019-08-26 16:47 UTC (permalink / raw)
  To: Florian Westphal, Pablo Neira Ayuso
  Cc: netfilter-devel, coreteam, netdev, linux-kernel, Jozsef Kadlecsik,
	David S. Miller
In-Reply-To: <20190821095844.me6kscvnfruinseu@salvia>

[-- Attachment #1: Type: text/plain, Size: 1169 bytes --]

Hello Pablo, Florian,

I implemented a V2 of this patch with the changes you proposed.
Could you please give your feedback on that patch?
https://lkml.org/lkml/2019/8/21/527

Thanks!

On Wed, 2019-08-21 at 11:58 +0200, Pablo Neira Ayuso wrote:
> On Tue, Aug 20, 2019 at 01:15:58PM -0300, Leonardo Bras wrote:
> > On Tue, 2019-08-20 at 07:36 +0200, Florian Westphal wrote:
> > > Wouldn't fib_netdev.c have the same problem?
> > Probably, but I haven't hit this issue yet.
> > 
> > > If so, might be better to place this test in both
> > > nft_fib6_eval_type and nft_fib6_eval.
> > 
> > I think that is possible, and not very hard to do.
> > 
> > But in my humble viewpoint, it looks like it's nft_fib_inet_eval() and
> > nft_fib_netdev_eval() have the responsibility to choose a valid
> > protocol or drop the package. 
> > I am not sure if it would be a good move to transfer this
> > responsibility to nft_fib6_eval_type() and nft_fib6_eval(), so I would
> > rather add the same test to nft_fib_netdev_eval().
> > 
> > Does it make sense?
> 
> Please, update common code to netdev and ip6 extensions as Florian
> suggests.
> 
> Thanks.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: Unable to create htb tc classes more than 64K
From: Jesper Dangaard Brouer @ 2019-08-26 16:45 UTC (permalink / raw)
  To: Akshat Kakkar
  Cc: brouer, Cong Wang, NetFilter, lartc, netdev, Eric Dumazet,
	Toke Høiland-Jørgensen, Anton Danilov
In-Reply-To: <CAA5aLPjzX+9YFRGgCgceHjkU0=e6x8YMENfp_cC9fjfHYK3e+A@mail.gmail.com>

On Sun, 18 Aug 2019 00:34:33 +0530
Akshat Kakkar <akshat.1984@gmail.com> wrote:

> My goal is not just to make as many classes as possible, but also to
> use them to do rate limiting per ip per server. Say, I have a list of
> 10000 IPs and more than 100 servers. So simply if I want few IPs to
> get speed of says 1Mbps per server but others say speed of 2 Mbps per
> server. How can I achieve this without having 10000 x 100 classes.
> These numbers can be large than this and hence I am looking for a
> generic solution to this.

As Eric Dumazet also points out indirectly, you will be creating a huge
bottleneck for SMP/multi-core CPUs.  As your HTB root qdisc is a
serialization point for all egress traffic, that all CPUs will need to
take a lock on.

It sounds like your use-case is not global rate limiting, but instead
the goal is to rate limit customers or services (to something
significantly lower than NIC link speed).  To get scalability, in this
case, you can instead use the MQ qdisc (as Eric also points out).
I have an example script here[1], that shows how to setup MQ as root
qdisc and add HTB leafs based on how many TX-queue the interface have
via /sys/class/net/$DEV/queues/tx-*/

[1] https://github.com/xdp-project/xdp-cpumap-tc/blob/master/bin/tc_mq_htb_setup_example.sh

You are not done, yet.  For solving the TX-queue locking congestion, the
traffic needs to be redirected to the appropriate/correct TX CPUs. This
can either be done with RSS (Receive Side Scaling) HW ethtool
adjustment (reduce hash to IPs L3 only), or RPS (Receive Packet
Steering), or with XDP cpumap redirect.

The XDP cpumap redirect feature is implemented with XDP+TC BPF code
here[2]. Notice, that XPS can screw with this so there is a XPS disable
script here[3].

[2] https://github.com/xdp-project/xdp-cpumap-tc
[3] https://github.com/xdp-project/xdp-cpumap-tc/blob/master/bin/xps_setup.sh

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH bpf] nfp: bpf: fix latency bug when updating stack index register
From: Jakub Kicinski @ 2019-08-26 16:41 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Song Liu, bpf, Networking, OSS Drivers,
	Jiong Wang
In-Reply-To: <1417962c-e63d-6c46-bf07-9284f5332583@iogearbox.net>

On Mon, 26 Aug 2019 18:25:10 +0200, Daniel Borkmann wrote:
> On 8/26/19 6:18 PM, Alexei Starovoitov wrote:
> > On Mon, Aug 26, 2019 at 8:57 AM Jakub Kicinski
> > <jakub.kicinski@netronome.com> wrote:  
> >> On Sun, Aug 25, 2019 at 10:37 PM Song Liu <liu.song.a23@gmail.com> wrote:  
> >>> On Fri, Aug 23, 2019 at 7:04 PM Jakub Kicinski wrote:  
> >>>> From: Jiong Wang <jiong.wang@netronome.com>
> >>>>
> >>>> NFP is using Local Memory to model stack. LM_addr could be used as base of
> >>>> a 16 32-bit word region of Local Memory. Then, if the stack offset is
> >>>> beyond the current region, the local index needs to be updated. The update
> >>>> needs at least three cycles to take effect, therefore the sequence normally
> >>>> looks like:
> >>>>
> >>>>    local_csr_wr[ActLMAddr3, gprB_5]
> >>>>    nop
> >>>>    nop
> >>>>    nop
> >>>>
> >>>> If the local index switch happens on a narrow loads, then the instruction
> >>>> preparing value to zero high 32-bit of the destination register could be
> >>>> counted as one cycle, the sequence then could be something like:
> >>>>
> >>>>    local_csr_wr[ActLMAddr3, gprB_5]
> >>>>    nop
> >>>>    nop
> >>>>    immed[gprB_5, 0]
> >>>>
> >>>> However, we have zero extension optimization that zeroing high 32-bit could
> >>>> be eliminated, therefore above IMMED insn won't be available for which case
> >>>> the first sequence needs to be generated.
> >>>>
> >>>> Fixes: 0b4de1ff19bf ("nfp: bpf: eliminate zero extension code-gen")
> >>>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> >>>> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>  
> >>> I haven't looked into the code yet. But ^^^ should be
> >>>
> >>> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> >>>
> >>> right?  
> >>
> >> I prefer Review on code I review, ack on code I ack, and sign-off on
> >> code I co-author.  
> > 
> > I believe if you're sending somebody else patch you have to add your SOB
> > in addition to their 'Author:' and their SOB fields.  
> 
> +1, for co-authoring there's a 'Co-authored-by:' tag which seems to be frequently
> used these days.

Ack, there is a difference between co-author of code, and co-author as
step by step guidance. I've been doing this for 6 years now, and nobody
ever complained :)

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Is that enough or should I repost?

^ permalink raw reply

* Re: [PATCH v2] riscv: add support for SECCOMP and SECCOMP_FILTER
From: David Abdurachmanov @ 2019-08-26 16:39 UTC (permalink / raw)
  To: Tycho Andersen
  Cc: Paul Walmsley, Palmer Dabbelt, Albert Ou, Oleg Nesterov,
	Kees Cook, Andy Lutomirski, Will Drewry, Shuah Khan,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David Abdurachmanov, Thomas Gleixner,
	Allison Randal, Alexios Zavras, Anup Patel, Vincent Chen,
	Alan Kao, linux-riscv, linux-kernel, linux-kselftest, netdev, bpf,
	me
In-Reply-To: <20190826145756.GB4664@cisco>

On Mon, Aug 26, 2019 at 7:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> Hi,
>
> On Fri, Aug 23, 2019 at 05:30:53PM -0700, Paul Walmsley wrote:
> > On Thu, 22 Aug 2019, David Abdurachmanov wrote:
> >
> > > There is one failing kernel selftest: global.user_notification_signal
> >
> > Also - could you follow up with the author of this failing test to see if
> > we can get some more clarity about what might be going wrong here?  It
> > appears that the failing test was added in commit 6a21cc50f0c7f ("seccomp:
> > add a return code to trap to userspace") by Tycho Andersen
> > <tycho@tycho.ws>.
>
> Can you post an strace and a cat of /proc/$pid/stack for both tasks
> where it gets stuck? I don't have any riscv hardware, and it "works
> for me" on x86 and arm64 with 100 tries.

I don't have the a build with SECCOMP for the board right now, so it
will have to wait. I just finished a new kernel (almost rc6) for Fedora,
but it will take time to assemble new repositories and a disk image.

There is older disk image available (5.2.0-rc7 kernel with v2 SECCOMP)
for QEMU or libvirt/QEMU:

https://dl.fedoraproject.org/pub/alt/risc-v/disk-images/fedora/rawhide/20190703.n.0/Developer/
https://fedoraproject.org/wiki/Architectures/RISC-V/Installing#Boot_with_libvirt

(If you are interesting trying it locally.)

IIRC I attempted to connected with strace, but it quickly returns and fails
properly. Simply put strace unblocks whatever is stuck.

david

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/4] xsk: add proper barriers and {READ, WRITE}_ONCE-correctness for state
From: Björn Töpel @ 2019-08-26 16:34 UTC (permalink / raw)
  To: Ilya Maximets, Björn Töpel, ast, daniel, netdev
  Cc: magnus.karlsson, magnus.karlsson, bpf, jonathan.lemon,
	syzbot+c82697e3043781e08802, hdanton
In-Reply-To: <14576fd3-69ce-6493-5a38-c47566851d4e@samsung.com>

On 2019-08-26 17:24, Ilya Maximets wrote:
> This changes the error code a bit.
> Previously:
>     umem exists + xs unbound    --> EINVAL
>     no umem     + xs unbound    --> EBADF
>     xs bound to different dev/q --> EINVAL
> 
> With this change:
>     umem exists + xs unbound    --> EBADF
>     no umem     + xs unbound    --> EBADF
>     xs bound to different dev/q --> EINVAL
> 
> Just a note. Not sure if this is important.
> 

Note that this is for *shared* umem, so it's very seldom used. Still,
you're right, that strictly this is an uapi break, but I'd vote for the
change still. I find it hard to see that anyone relies on EINVAL/EBADF
for shared umem bind.

Opinions? :-)


Björn

^ permalink raw reply

* Re: [PATCH] samples: bpf: add max_pckt_size option at xdp_adjust_tail
From: Daniel T. Lee @ 2019-08-26 16:26 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190826175420.000021f3@gmail.com>

On Tue, Aug 27, 2019 at 12:54 AM Maciej Fijalkowski
<maciejromanfijalkowski@gmail.com> wrote:
>
> On Mon, 26 Aug 2019 18:57:22 +0900
> "Daniel T. Lee" <danieltimlee@gmail.com> wrote:
>
> > Currently, at xdp_adjust_tail_kern.c, MAX_PCKT_SIZE is limited
> > to 600. To make this size flexible, a new map 'pcktsz' is added.
> >
> > By updating new packet size to this map from the userland,
> > xdp_adjust_tail_kern.o will use this value as a new max_pckt_size.
> >
> > If no '-P <MAX_PCKT_SIZE>' option is used, the size of maximum packet
> > will be 600 as a default.
> >
> > Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> > ---
> >  samples/bpf/xdp_adjust_tail_kern.c | 23 +++++++++++++++++++----
> >  samples/bpf/xdp_adjust_tail_user.c | 21 +++++++++++++++++++--
> >  2 files changed, 38 insertions(+), 6 deletions(-)
> >
> > diff --git a/samples/bpf/xdp_adjust_tail_kern.c b/samples/bpf/xdp_adjust_tail_kern.c
> > index 411fdb21f8bc..4d53af370b68 100644
> > --- a/samples/bpf/xdp_adjust_tail_kern.c
> > +++ b/samples/bpf/xdp_adjust_tail_kern.c
> > @@ -25,6 +25,13 @@
> >  #define ICMP_TOOBIG_SIZE 98
> >  #define ICMP_TOOBIG_PAYLOAD_SIZE 92
> >
> > +struct bpf_map_def SEC("maps") pcktsz = {
> > +     .type = BPF_MAP_TYPE_ARRAY,
> > +     .key_size = sizeof(__u32),
> > +     .value_size = sizeof(__u32),
> > +     .max_entries = 1,
> > +};
> > +
> >  struct bpf_map_def SEC("maps") icmpcnt = {
> >       .type = BPF_MAP_TYPE_ARRAY,
> >       .key_size = sizeof(__u32),
> > @@ -64,7 +71,8 @@ static __always_inline void ipv4_csum(void *data_start, int data_size,
> >       *csum = csum_fold_helper(*csum);
> >  }
> >
> > -static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
> > +static __always_inline int send_icmp4_too_big(struct xdp_md *xdp,
> > +                                           __u32 max_pckt_size)
> >  {
> >       int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr);
> >
> > @@ -92,7 +100,7 @@ static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
> >       orig_iph = data + off;
> >       icmp_hdr->type = ICMP_DEST_UNREACH;
> >       icmp_hdr->code = ICMP_FRAG_NEEDED;
> > -     icmp_hdr->un.frag.mtu = htons(MAX_PCKT_SIZE-sizeof(struct ethhdr));
> > +     icmp_hdr->un.frag.mtu = htons(max_pckt_size - sizeof(struct ethhdr));
> >       icmp_hdr->checksum = 0;
> >       ipv4_csum(icmp_hdr, ICMP_TOOBIG_PAYLOAD_SIZE, &csum);
> >       icmp_hdr->checksum = csum;
> > @@ -118,14 +126,21 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
> >  {
> >       void *data_end = (void *)(long)xdp->data_end;
> >       void *data = (void *)(long)xdp->data;
> > +     __u32 max_pckt_size = MAX_PCKT_SIZE;
> > +     __u32 *pckt_sz;
> > +     __u32 key = 0;
> >       int pckt_size = data_end - data;
> >       int offset;
> >
> > -     if (pckt_size > MAX_PCKT_SIZE) {
> > +     pckt_sz = bpf_map_lookup_elem(&pcktsz, &key);
> > +     if (pckt_sz && *pckt_sz)
> > +             max_pckt_size = *pckt_sz;
> > +
> > +     if (pckt_size > max_pckt_size) {
> >               offset = pckt_size - ICMP_TOOBIG_SIZE;
> >               if (bpf_xdp_adjust_tail(xdp, 0 - offset))
> >                       return XDP_PASS;
> > -             return send_icmp4_too_big(xdp);
> > +             return send_icmp4_too_big(xdp, max_pckt_size);
> >       }
> >       return XDP_PASS;
> >  }
> > diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
> > index a3596b617c4c..dd3befa5e1fe 100644
> > --- a/samples/bpf/xdp_adjust_tail_user.c
> > +++ b/samples/bpf/xdp_adjust_tail_user.c
> > @@ -72,6 +72,7 @@ static void usage(const char *cmd)
> >       printf("Usage: %s [...]\n", cmd);
> >       printf("    -i <ifname|ifindex> Interface\n");
> >       printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
> > +     printf("    -P <MAX_PCKT_SIZE> Default: 600\n");
> >       printf("    -S use skb-mode\n");
> >       printf("    -N enforce native mode\n");
> >       printf("    -F force loading prog\n");
> > @@ -85,9 +86,11 @@ int main(int argc, char **argv)
> >               .prog_type      = BPF_PROG_TYPE_XDP,
> >       };
> >       unsigned char opt_flags[256] = {};
> > -     const char *optstr = "i:T:SNFh";
> > +     const char *optstr = "i:T:P:SNFh";
> >       struct bpf_prog_info info = {};
> >       __u32 info_len = sizeof(info);
> > +     __u32 max_pckt_size = 0;
> > +     __u32 key = 0;
> >       unsigned int kill_after_s = 0;
> >       int i, prog_fd, map_fd, opt;
> >       struct bpf_object *obj;
> > @@ -110,6 +113,9 @@ int main(int argc, char **argv)
> >               case 'T':
> >                       kill_after_s = atoi(optarg);
> >                       break;
> > +             case 'P':
> > +                     max_pckt_size = atoi(optarg);
> > +                     break;
> >               case 'S':
> >                       xdp_flags |= XDP_FLAGS_SKB_MODE;
> >                       break;
> > @@ -150,9 +156,20 @@ int main(int argc, char **argv)
> >       if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
> >               return 1;
> >
> > +     /* update pcktsz map */
> >       map = bpf_map__next(NULL, obj);
> >       if (!map) {
> > -             printf("finding a map in obj file failed\n");
> > +             printf("finding a pcktsz map in obj file failed\n");
> > +             return 1;
> > +     }
> > +     map_fd = bpf_map__fd(map);
>
> Consider using bpf_object__find_map_fd_by_name() here.
>
> > +     if (max_pckt_size)
> > +             bpf_map_update_elem(map_fd, &key, &max_pckt_size, BPF_ANY);
> > +
> > +     /* fetch icmpcnt map */
> > +     map = bpf_map__next(map, obj);
> > +     if (!map) {
> > +             printf("finding a icmpcnt map in obj file failed\n");
> >               return 1;
> >       }
> >       map_fd = bpf_map__fd(map);
>

Thanks for the review!
I'll update it right away.

^ permalink raw reply

* [bpf-next, v2] samples: bpf: add max_pckt_size option at xdp_adjust_tail
From: Daniel T. Lee @ 2019-08-26 16:25 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev

Currently, at xdp_adjust_tail_kern.c, MAX_PCKT_SIZE is limited
to 600. To make this size flexible, a new map 'pcktsz' is added.

By updating new packet size to this map from the userland,
xdp_adjust_tail_kern.o will use this value as a new max_pckt_size.

If no '-P <MAX_PCKT_SIZE>' option is used, the size of maximum packet
will be 600 as a default.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>

---
Changes in v2:
    - Change the helper to fetch map from 'bpf_map__next' to 
    'bpf_object__find_map_fd_by_name'.

 samples/bpf/xdp_adjust_tail_kern.c | 23 +++++++++++++++++++----
 samples/bpf/xdp_adjust_tail_user.c | 27 +++++++++++++++++++++------
 2 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/samples/bpf/xdp_adjust_tail_kern.c b/samples/bpf/xdp_adjust_tail_kern.c
index 411fdb21f8bc..d6d84ffe6a7a 100644
--- a/samples/bpf/xdp_adjust_tail_kern.c
+++ b/samples/bpf/xdp_adjust_tail_kern.c
@@ -25,6 +25,13 @@
 #define ICMP_TOOBIG_SIZE 98
 #define ICMP_TOOBIG_PAYLOAD_SIZE 92
 
+struct bpf_map_def SEC("maps") pcktsz = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(__u32),
+	.max_entries = 1,
+};
+
 struct bpf_map_def SEC("maps") icmpcnt = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(__u32),
@@ -64,7 +71,8 @@ static __always_inline void ipv4_csum(void *data_start, int data_size,
 	*csum = csum_fold_helper(*csum);
 }
 
-static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
+static __always_inline int send_icmp4_too_big(struct xdp_md *xdp,
+					      __u32 max_pckt_size)
 {
 	int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr);
 
@@ -92,7 +100,7 @@ static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
 	orig_iph = data + off;
 	icmp_hdr->type = ICMP_DEST_UNREACH;
 	icmp_hdr->code = ICMP_FRAG_NEEDED;
-	icmp_hdr->un.frag.mtu = htons(MAX_PCKT_SIZE-sizeof(struct ethhdr));
+	icmp_hdr->un.frag.mtu = htons(max_pckt_size - sizeof(struct ethhdr));
 	icmp_hdr->checksum = 0;
 	ipv4_csum(icmp_hdr, ICMP_TOOBIG_PAYLOAD_SIZE, &csum);
 	icmp_hdr->checksum = csum;
@@ -118,14 +126,21 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
 {
 	void *data_end = (void *)(long)xdp->data_end;
 	void *data = (void *)(long)xdp->data;
+	__u32 max_pckt_size = MAX_PCKT_SIZE;
+	__u32 *pckt_sz;
+	__u32 key = 0;
 	int pckt_size = data_end - data;
 	int offset;
 
-	if (pckt_size > MAX_PCKT_SIZE) {
+	pckt_sz = bpf_map_lookup_elem(&pcktsz, &key);
+	if (pckt_sz && *pckt_sz)
+		max_pckt_size = *pckt_sz;
+
+	if (pckt_size > max_pckt_size) {
 		offset = pckt_size - ICMP_TOOBIG_SIZE;
 		if (bpf_xdp_adjust_tail(xdp, 0 - offset))
 			return XDP_PASS;
-		return send_icmp4_too_big(xdp);
+		return send_icmp4_too_big(xdp, max_pckt_size);
 	}
 	return XDP_PASS;
 }
diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
index a3596b617c4c..29ade7caf841 100644
--- a/samples/bpf/xdp_adjust_tail_user.c
+++ b/samples/bpf/xdp_adjust_tail_user.c
@@ -72,6 +72,7 @@ static void usage(const char *cmd)
 	printf("Usage: %s [...]\n", cmd);
 	printf("    -i <ifname|ifindex> Interface\n");
 	printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
+	printf("    -P <MAX_PCKT_SIZE> Default: 600\n");
 	printf("    -S use skb-mode\n");
 	printf("    -N enforce native mode\n");
 	printf("    -F force loading prog\n");
@@ -85,13 +86,14 @@ int main(int argc, char **argv)
 		.prog_type	= BPF_PROG_TYPE_XDP,
 	};
 	unsigned char opt_flags[256] = {};
-	const char *optstr = "i:T:SNFh";
+	const char *optstr = "i:T:P:SNFh";
 	struct bpf_prog_info info = {};
 	__u32 info_len = sizeof(info);
+	__u32 max_pckt_size = 0;
+	__u32 key = 0;
 	unsigned int kill_after_s = 0;
 	int i, prog_fd, map_fd, opt;
 	struct bpf_object *obj;
-	struct bpf_map *map;
 	char filename[256];
 	int err;
 
@@ -110,6 +112,9 @@ int main(int argc, char **argv)
 		case 'T':
 			kill_after_s = atoi(optarg);
 			break;
+		case 'P':
+			max_pckt_size = atoi(optarg);
+			break;
 		case 'S':
 			xdp_flags |= XDP_FLAGS_SKB_MODE;
 			break;
@@ -150,12 +155,22 @@ int main(int argc, char **argv)
 	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
 		return 1;
 
-	map = bpf_map__next(NULL, obj);
-	if (!map) {
-		printf("finding a map in obj file failed\n");
+	/* update pcktsz map */
+	if (max_pckt_size) {
+		map_fd = bpf_object__find_map_fd_by_name(obj, "pcktsz");
+		if (!map_fd) {
+			printf("finding a pcktsz map in obj file failed\n");
+			return 1;
+		}
+		bpf_map_update_elem(map_fd, &key, &max_pckt_size, BPF_ANY);
+	}
+
+	/* fetch icmpcnt map */
+	map_fd = bpf_object__find_map_fd_by_name(obj, "icmpcnt");
+	if (!map_fd) {
+		printf("finding a icmpcnt map in obj file failed\n");
 		return 1;
 	}
-	map_fd = bpf_map__fd(map);
 
 	if (!prog_fd) {
 		printf("load_bpf_file: %s\n", strerror(errno));
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH bpf] nfp: bpf: fix latency bug when updating stack index register
From: Daniel Borkmann @ 2019-08-26 16:25 UTC (permalink / raw)
  To: Alexei Starovoitov, Jakub Kicinski
  Cc: Song Liu, bpf, Networking, OSS Drivers, Jiong Wang
In-Reply-To: <CAADnVQ++TEUK=Cb3sCyunFyYFcpXu=NK71P4-1rEWEGCGewU7A@mail.gmail.com>

On 8/26/19 6:18 PM, Alexei Starovoitov wrote:
> On Mon, Aug 26, 2019 at 8:57 AM Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
>> On Sun, Aug 25, 2019 at 10:37 PM Song Liu <liu.song.a23@gmail.com> wrote:
>>> On Fri, Aug 23, 2019 at 7:04 PM Jakub Kicinski wrote:
>>>> From: Jiong Wang <jiong.wang@netronome.com>
>>>>
>>>> NFP is using Local Memory to model stack. LM_addr could be used as base of
>>>> a 16 32-bit word region of Local Memory. Then, if the stack offset is
>>>> beyond the current region, the local index needs to be updated. The update
>>>> needs at least three cycles to take effect, therefore the sequence normally
>>>> looks like:
>>>>
>>>>    local_csr_wr[ActLMAddr3, gprB_5]
>>>>    nop
>>>>    nop
>>>>    nop
>>>>
>>>> If the local index switch happens on a narrow loads, then the instruction
>>>> preparing value to zero high 32-bit of the destination register could be
>>>> counted as one cycle, the sequence then could be something like:
>>>>
>>>>    local_csr_wr[ActLMAddr3, gprB_5]
>>>>    nop
>>>>    nop
>>>>    immed[gprB_5, 0]
>>>>
>>>> However, we have zero extension optimization that zeroing high 32-bit could
>>>> be eliminated, therefore above IMMED insn won't be available for which case
>>>> the first sequence needs to be generated.
>>>>
>>>> Fixes: 0b4de1ff19bf ("nfp: bpf: eliminate zero extension code-gen")
>>>> Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
>>>> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>>> I haven't looked into the code yet. But ^^^ should be
>>>
>>> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>>>
>>> right?
>>
>> I prefer Review on code I review, ack on code I ack, and sign-off on
>> code I co-author.
> 
> I believe if you're sending somebody else patch you have to add your SOB
> in addition to their 'Author:' and their SOB fields.

+1, for co-authoring there's a 'Co-authored-by:' tag which seems to be frequently
used these days.

^ permalink raw reply

* [PATCH net] tcp: remove empty skb from write queue in error cases
From: Eric Dumazet @ 2019-08-26 16:19 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Eric Dumazet, Soheil Hassas Yeganeh, Neal Cardwell,
	Eric Dumazet, Jason Baron, Vladimir Rutsky

Vladimir Rutsky reported stuck TCP sessions after memory pressure
events. Edge Trigger epoll() user would never receive an EPOLLOUT
notification allowing them to retry a sendmsg().

Jason tested the case of sk_stream_alloc_skb() returning NULL,
but there are other paths that could lead both sendmsg() and sendpage()
to return -1 (EAGAIN), with an empty skb queued on the write queue.

This patch makes sure we remove this empty skb so that
Jason code can detect that the queue is empty, and
call sk->sk_write_space(sk) accordingly.

Fixes: ce5ec440994b ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Reported-by: Vladimir Rutsky <rutsky@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
---
 net/ipv4/tcp.c | 30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 77b485d60b9d0e00edc4e2f0d6c5bb3a9460b23b..61082065b26a068975c411b74eb46739ab0632ca 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -935,6 +935,22 @@ static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
 	return mss_now;
 }
 
+/* In some cases, both sendpage() and sendmsg() could have added
+ * an skb to the write queue, but failed adding payload on it.
+ * We need to remove it to consume less memory, but more
+ * importantly be able to generate EPOLLOUT for Edge Trigger epoll()
+ * users.
+ */
+static void tcp_remove_empty_skb(struct sock *sk, struct sk_buff *skb)
+{
+	if (skb && !skb->len) {
+		tcp_unlink_write_queue(skb, sk);
+		if (tcp_write_queue_empty(sk))
+			tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
+		sk_wmem_free_skb(sk, skb);
+	}
+}
+
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 			 size_t size, int flags)
 {
@@ -1064,6 +1080,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 	return copied;
 
 do_error:
+	tcp_remove_empty_skb(sk, tcp_write_queue_tail(sk));
 	if (copied)
 		goto out;
 out_err:
@@ -1388,18 +1405,11 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 	sock_zerocopy_put(uarg);
 	return copied + copied_syn;
 
+do_error:
+	skb = tcp_write_queue_tail(sk);
 do_fault:
-	if (!skb->len) {
-		tcp_unlink_write_queue(skb, sk);
-		/* It is the one place in all of TCP, except connection
-		 * reset, where we can be unlinking the send_head.
-		 */
-		if (tcp_write_queue_empty(sk))
-			tcp_chrono_stop(sk, TCP_CHRONO_BUSY);
-		sk_wmem_free_skb(sk, skb);
-	}
+	tcp_remove_empty_skb(sk, skb);
 
-do_error:
 	if (copied + copied_syn)
 		goto out;
 out_err:
-- 
2.23.0.187.g17f5b7556c-goog


^ permalink raw reply related

* Re: [PATCH bpf] nfp: bpf: fix latency bug when updating stack index register
From: Alexei Starovoitov @ 2019-08-26 16:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Song Liu, Daniel Borkmann, bpf, Networking, OSS Drivers,
	Jiong Wang
In-Reply-To: <CAJpBn1z736w5_uv7apwyy82vzcnc9c5Gua_9ZyUy-pSEwnQewA@mail.gmail.com>

On Mon, Aug 26, 2019 at 8:57 AM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Sun, Aug 25, 2019 at 10:37 PM Song Liu <liu.song.a23@gmail.com> wrote:
> > On Fri, Aug 23, 2019 at 7:04 PM Jakub Kicinski wrote:
> > > From: Jiong Wang <jiong.wang@netronome.com>
> > >
> > > NFP is using Local Memory to model stack. LM_addr could be used as base of
> > > a 16 32-bit word region of Local Memory. Then, if the stack offset is
> > > beyond the current region, the local index needs to be updated. The update
> > > needs at least three cycles to take effect, therefore the sequence normally
> > > looks like:
> > >
> > >   local_csr_wr[ActLMAddr3, gprB_5]
> > >   nop
> > >   nop
> > >   nop
> > >
> > > If the local index switch happens on a narrow loads, then the instruction
> > > preparing value to zero high 32-bit of the destination register could be
> > > counted as one cycle, the sequence then could be something like:
> > >
> > >   local_csr_wr[ActLMAddr3, gprB_5]
> > >   nop
> > >   nop
> > >   immed[gprB_5, 0]
> > >
> > > However, we have zero extension optimization that zeroing high 32-bit could
> > > be eliminated, therefore above IMMED insn won't be available for which case
> > > the first sequence needs to be generated.
> > >
> > > Fixes: 0b4de1ff19bf ("nfp: bpf: eliminate zero extension code-gen")
> > > Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> > > Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> > I haven't looked into the code yet. But ^^^ should be
> >
> > Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> >
> > right?
>
> I prefer Review on code I review, ack on code I ack, and sign-off on
> code I co-author.

I believe if you're sending somebody else patch you have to add your SOB
in addition to their 'Author:' and their SOB fields.

^ permalink raw reply

* Re: [PATCH v2 net-next 2/2] net: dsa: tag_8021q: Restore bridge VLANs when enabling vlan_filtering
From: Vladimir Oltean @ 2019-08-26 16:13 UTC (permalink / raw)
  To: Vivien Didelot
  Cc: Florian Fainelli, Andrew Lunn, Ido Schimmel, Roopa Prabhu,
	nikolay, David S. Miller, netdev
In-Reply-To: <20190826112049.GB27025@t480s.localdomain>

Hi Vivien,

On Mon, 26 Aug 2019 at 18:20, Vivien Didelot <vivien.didelot@gmail.com> wrote:
>
> Hi Vladimir,
>
> On Sun, 25 Aug 2019 21:44:54 +0300, Vladimir Oltean <olteanv@gmail.com> wrote:
> > -     if (enabled)
> > -             err = dsa_port_vid_add(upstream_dp, tx_vid, 0);
> > -     else
> > -             err = dsa_port_vid_del(upstream_dp, tx_vid);
> > +     err = dsa_8021q_vid_apply(ds, upstream, tx_vid, 0, enabled);
> >       if (err) {
> >               dev_err(ds->dev, "Failed to apply TX VID %d on port %d: %d\n",
> >                       tx_vid, upstream, err);
> >               return err;
> >       }
> >
> > -     return 0;
> > +     if (!enabled)
> > +             err = dsa_8021q_restore_pvid(ds, port);
> > +
> > +     return err;
> >  }
>
> I did not dig that much into tag_8021q.c yet. From seeing this portion,
> I'm just wondering if these two helpers couldn't be part of the same logic
> as they both act upon the "enabled" condition?
>
> Otherwise I have no complains about the series.
>

I thought too about trying to merge the 2 into the same function (not
a lot, though).
But consider that they do different things in the "!enabled" case:
- dsa_8021q_vid_apply: check if this specific vid (provided as
argument) was installed in the bridge, and if so, restore it
- dsa_8021q_restore_pvid: search for the bridge port's pvid, and restore that
I don't think that the end result will look cleaner if I merge these 2 things.

>
> Thanks,
>
>         Vivien

Thanks,
-Vladimir

^ permalink raw reply

* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jiri Pirko @ 2019-08-26 16:09 UTC (permalink / raw)
  To: David Ahern
  Cc: Roopa Prabhu, netdev, David Miller, Jakub Kicinski,
	Stephen Hemminger, dcbw, Michal Kubecek, Andrew Lunn, parav,
	Saeed Mahameed, mlxsw
In-Reply-To: <20190813065617.GK2428@nanopsycho>

Tue, Aug 13, 2019 at 08:56:17AM CEST, jiri@resnulli.us wrote:
>Mon, Aug 12, 2019 at 06:01:59PM CEST, dsahern@gmail.com wrote:
>>On 8/12/19 2:31 AM, Jiri Pirko wrote:
>>> Mon, Aug 12, 2019 at 03:37:26AM CEST, dsahern@gmail.com wrote:
>>>> On 8/11/19 7:34 PM, David Ahern wrote:
>>>>> On 8/10/19 12:30 AM, Jiri Pirko wrote:
>>>>>> Could you please write me an example message of add/remove?
>>>>>
>>>>> altnames are for existing netdevs, yes? existing netdevs have an id and
>>>>> a name - 2 existing references for identifying the existing netdev for
>>>>> which an altname will be added. Even using the altname as the main
>>>>> 'handle' for a setlink change, I see no reason why the GETLINK api can
>>>>> not take an the IFLA_ALT_IFNAME and return the full details of the
>>>>> device if the altname is unique.
>>>>>
>>>>> So, what do the new RTM commands give you that you can not do with
>>>>> RTM_*LINK?
>>>>>
>>>>
>>>>
>>>> To put this another way, the ALT_NAME is an attribute of an object - a
>>>> LINK. It is *not* a separate object which requires its own set of
>>>> commands for manipulating.
>>> 
>>> Okay, again, could you provide example of a message to add/remove
>>> altname using existing setlink message? Thanks!
>>> 
>>
>>Examples from your cover letter with updates
>>
>>$ ip link set dummy0 altname someothername
>>$ ip link set dummy0 altname someotherveryveryveryverylongname
>>
>>$ ip link set dummy0 del altname someothername
>>$ ip link set dummy0 del altname someotherveryveryveryverylongname
>>
>>This syntactic sugar to what is really happening:
>>
>>RTM_NEWLINK, dummy0, IFLA_ALT_IFNAME
>>
>>if you are allowing many alt names, then yes, you need a flag to say
>>delete this specific one which is covered by Roopa's nested suggestion.
>
>Yeah, so you need and op inside the message. We are on the same page,
>thanks.

DaveA, Roopa. Do you insist on doing add/remove of altnames in the
existing setlist command using embedded message op attrs? I'm asking
because after some time thinking about it, it still feels wrong to me :/

If this would be a generic netlink api, we would just add another couple
of commands. What is so different we can't add commands here?
It is also much simpler code. Easy error handling, no need for
rollback, no possibly inconsistent state, etc.


^ permalink raw reply

* Re: [PATCH net-next v3 4/6] net: dsa: mv88e6xxx: simplify SERDES code for Topaz and Peridot
From: Vivien Didelot @ 2019-08-26 16:08 UTC (permalink / raw)
  To: Marek Behun; +Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190825183609.4a9cc0d7@nic.cz>

Hi Marek,

On Sun, 25 Aug 2019 18:36:09 +0200, Marek Behun <marek.behun@nic.cz> wrote:
> > Aren't you relying on -ENODEV as well?
> 
> Vivien, I am not relying o -ENODEV. I changed the serdes_get_lane
> semantics:
>  - previously:
>    - if port has a lane for current cmode, return given lane number
>    - otherwise return -ENODEV
>    - if other error occured during serdes_get_lane, return that error
>      (this never happened, because all implementations only need port
>      number and cmode, and cmode is cached, so no function was called
>      that could err)
>  - after this commit:
>    - if port has a lane for current cmode, return 0 and put lane number
>      into *lane
>    - otherwise return 0 and put -1 into *lane
>    - if error occured, return that error number
> 
> I removed the -ENODEV semantics for "no lane on port" event.
> There are two reasons for this:
>   1. once you requested lane number to be put into a place pointed to
>      by a pointer, rather than the return value, the code seemed better
>      to me (you may of course disagree, this is a personal opinion) when
>      I did:
>        if (err)
>            return err;
>        if (lane < 0)
>            return 0;
>      rather than
>        if (err == -ENODEV)
>            return 0;
>        if (err)
>            return err;

A single return path for invalid queries, eventually checking a specific
error, is always more idiomatic and better than checking two places which
could lead in mistakes as your previous patch did. So this is more readable:

    if (err)
        return err;

or:

    if (err && err != -ENODEV)
        return err;

or:

    if (err) {
        if (err = -ENODEV)
            err = 0;
        return err;
    }

>   2. some future implementation may actually need to call some MDIO
>      read/write functions, which may or may not return -ENODEV. That
>      could conflict with the -ENODEV returned when there is no lane.

The current code is already using -ENODEV to inform about "no lane for port",
even if it can be used by lower level functions, same as -EINVAL. That is fine.

So if you have to respin the series again, I would really prefer to see an
unsigned lane parameter, otherwise, fine...


Thanks,

	Vivien

^ permalink raw reply

* Re: [PATCH bpf] nfp: bpf: fix latency bug when updating stack index register
From: Jakub Kicinski @ 2019-08-26 15:57 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, Daniel Borkmann, bpf, Networking, OSS Drivers,
	Jiong Wang
In-Reply-To: <CAPhsuW7_dSEPJOdKApQFU-aVmEXgOwmqLS7S1FC4JtnzjR6OiQ@mail.gmail.com>

On Sun, Aug 25, 2019 at 10:37 PM Song Liu <liu.song.a23@gmail.com> wrote:
> On Fri, Aug 23, 2019 at 7:04 PM Jakub Kicinski wrote:
> > From: Jiong Wang <jiong.wang@netronome.com>
> >
> > NFP is using Local Memory to model stack. LM_addr could be used as base of
> > a 16 32-bit word region of Local Memory. Then, if the stack offset is
> > beyond the current region, the local index needs to be updated. The update
> > needs at least three cycles to take effect, therefore the sequence normally
> > looks like:
> >
> >   local_csr_wr[ActLMAddr3, gprB_5]
> >   nop
> >   nop
> >   nop
> >
> > If the local index switch happens on a narrow loads, then the instruction
> > preparing value to zero high 32-bit of the destination register could be
> > counted as one cycle, the sequence then could be something like:
> >
> >   local_csr_wr[ActLMAddr3, gprB_5]
> >   nop
> >   nop
> >   immed[gprB_5, 0]
> >
> > However, we have zero extension optimization that zeroing high 32-bit could
> > be eliminated, therefore above IMMED insn won't be available for which case
> > the first sequence needs to be generated.
> >
> > Fixes: 0b4de1ff19bf ("nfp: bpf: eliminate zero extension code-gen")
> > Signed-off-by: Jiong Wang <jiong.wang@netronome.com>
> > Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> I haven't looked into the code yet. But ^^^ should be
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
>
> right?

I prefer Review on code I review, ack on code I ack, and sign-off on
code I co-author.

^ permalink raw reply

* Re: [PATCH] samples: bpf: add max_pckt_size option at xdp_adjust_tail
From: Maciej Fijalkowski @ 2019-08-26 15:54 UTC (permalink / raw)
  To: Daniel T. Lee; +Cc: Daniel Borkmann, Alexei Starovoitov, netdev
In-Reply-To: <20190826095722.28229-1-danieltimlee@gmail.com>

On Mon, 26 Aug 2019 18:57:22 +0900
"Daniel T. Lee" <danieltimlee@gmail.com> wrote:

> Currently, at xdp_adjust_tail_kern.c, MAX_PCKT_SIZE is limited
> to 600. To make this size flexible, a new map 'pcktsz' is added.
> 
> By updating new packet size to this map from the userland,
> xdp_adjust_tail_kern.o will use this value as a new max_pckt_size.
> 
> If no '-P <MAX_PCKT_SIZE>' option is used, the size of maximum packet
> will be 600 as a default.
> 
> Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
> ---
>  samples/bpf/xdp_adjust_tail_kern.c | 23 +++++++++++++++++++----
>  samples/bpf/xdp_adjust_tail_user.c | 21 +++++++++++++++++++--
>  2 files changed, 38 insertions(+), 6 deletions(-)
> 
> diff --git a/samples/bpf/xdp_adjust_tail_kern.c b/samples/bpf/xdp_adjust_tail_kern.c
> index 411fdb21f8bc..4d53af370b68 100644
> --- a/samples/bpf/xdp_adjust_tail_kern.c
> +++ b/samples/bpf/xdp_adjust_tail_kern.c
> @@ -25,6 +25,13 @@
>  #define ICMP_TOOBIG_SIZE 98
>  #define ICMP_TOOBIG_PAYLOAD_SIZE 92
>  
> +struct bpf_map_def SEC("maps") pcktsz = {
> +	.type = BPF_MAP_TYPE_ARRAY,
> +	.key_size = sizeof(__u32),
> +	.value_size = sizeof(__u32),
> +	.max_entries = 1,
> +};
> +
>  struct bpf_map_def SEC("maps") icmpcnt = {
>  	.type = BPF_MAP_TYPE_ARRAY,
>  	.key_size = sizeof(__u32),
> @@ -64,7 +71,8 @@ static __always_inline void ipv4_csum(void *data_start, int data_size,
>  	*csum = csum_fold_helper(*csum);
>  }
>  
> -static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
> +static __always_inline int send_icmp4_too_big(struct xdp_md *xdp,
> +					      __u32 max_pckt_size)
>  {
>  	int headroom = (int)sizeof(struct iphdr) + (int)sizeof(struct icmphdr);
>  
> @@ -92,7 +100,7 @@ static __always_inline int send_icmp4_too_big(struct xdp_md *xdp)
>  	orig_iph = data + off;
>  	icmp_hdr->type = ICMP_DEST_UNREACH;
>  	icmp_hdr->code = ICMP_FRAG_NEEDED;
> -	icmp_hdr->un.frag.mtu = htons(MAX_PCKT_SIZE-sizeof(struct ethhdr));
> +	icmp_hdr->un.frag.mtu = htons(max_pckt_size - sizeof(struct ethhdr));
>  	icmp_hdr->checksum = 0;
>  	ipv4_csum(icmp_hdr, ICMP_TOOBIG_PAYLOAD_SIZE, &csum);
>  	icmp_hdr->checksum = csum;
> @@ -118,14 +126,21 @@ static __always_inline int handle_ipv4(struct xdp_md *xdp)
>  {
>  	void *data_end = (void *)(long)xdp->data_end;
>  	void *data = (void *)(long)xdp->data;
> +	__u32 max_pckt_size = MAX_PCKT_SIZE;
> +	__u32 *pckt_sz;
> +	__u32 key = 0;
>  	int pckt_size = data_end - data;
>  	int offset;
>  
> -	if (pckt_size > MAX_PCKT_SIZE) {
> +	pckt_sz = bpf_map_lookup_elem(&pcktsz, &key);
> +	if (pckt_sz && *pckt_sz)
> +		max_pckt_size = *pckt_sz;
> +
> +	if (pckt_size > max_pckt_size) {
>  		offset = pckt_size - ICMP_TOOBIG_SIZE;
>  		if (bpf_xdp_adjust_tail(xdp, 0 - offset))
>  			return XDP_PASS;
> -		return send_icmp4_too_big(xdp);
> +		return send_icmp4_too_big(xdp, max_pckt_size);
>  	}
>  	return XDP_PASS;
>  }
> diff --git a/samples/bpf/xdp_adjust_tail_user.c b/samples/bpf/xdp_adjust_tail_user.c
> index a3596b617c4c..dd3befa5e1fe 100644
> --- a/samples/bpf/xdp_adjust_tail_user.c
> +++ b/samples/bpf/xdp_adjust_tail_user.c
> @@ -72,6 +72,7 @@ static void usage(const char *cmd)
>  	printf("Usage: %s [...]\n", cmd);
>  	printf("    -i <ifname|ifindex> Interface\n");
>  	printf("    -T <stop-after-X-seconds> Default: 0 (forever)\n");
> +	printf("    -P <MAX_PCKT_SIZE> Default: 600\n");
>  	printf("    -S use skb-mode\n");
>  	printf("    -N enforce native mode\n");
>  	printf("    -F force loading prog\n");
> @@ -85,9 +86,11 @@ int main(int argc, char **argv)
>  		.prog_type	= BPF_PROG_TYPE_XDP,
>  	};
>  	unsigned char opt_flags[256] = {};
> -	const char *optstr = "i:T:SNFh";
> +	const char *optstr = "i:T:P:SNFh";
>  	struct bpf_prog_info info = {};
>  	__u32 info_len = sizeof(info);
> +	__u32 max_pckt_size = 0;
> +	__u32 key = 0;
>  	unsigned int kill_after_s = 0;
>  	int i, prog_fd, map_fd, opt;
>  	struct bpf_object *obj;
> @@ -110,6 +113,9 @@ int main(int argc, char **argv)
>  		case 'T':
>  			kill_after_s = atoi(optarg);
>  			break;
> +		case 'P':
> +			max_pckt_size = atoi(optarg);
> +			break;
>  		case 'S':
>  			xdp_flags |= XDP_FLAGS_SKB_MODE;
>  			break;
> @@ -150,9 +156,20 @@ int main(int argc, char **argv)
>  	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
>  		return 1;
>  
> +	/* update pcktsz map */
>  	map = bpf_map__next(NULL, obj);
>  	if (!map) {
> -		printf("finding a map in obj file failed\n");
> +		printf("finding a pcktsz map in obj file failed\n");
> +		return 1;
> +	}
> +	map_fd = bpf_map__fd(map);

Consider using bpf_object__find_map_fd_by_name() here.

> +	if (max_pckt_size)
> +		bpf_map_update_elem(map_fd, &key, &max_pckt_size, BPF_ANY);
> +
> +	/* fetch icmpcnt map */
> +	map = bpf_map__next(map, obj);
> +	if (!map) {
> +		printf("finding a icmpcnt map in obj file failed\n");
>  		return 1;
>  	}
>  	map_fd = bpf_map__fd(map);


^ permalink raw reply

* Re: [PATCH net-next v3 04/10] net: sched: notify classifier on successful offload add/delete
From: Jiri Pirko @ 2019-08-26 15:51 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, davem, jakub.kicinski, pablo
In-Reply-To: <20190826134506.9705-5-vladbu@mellanox.com>

Mon, Aug 26, 2019 at 03:45:00PM CEST, vladbu@mellanox.com wrote:
>To remove dependency on rtnl lock, extend classifier ops with new
>ops->hw_add() and ops->hw_del() callbacks. Call them from cls API while
>holding cb_lock every time filter if successfully added to or deleted from
>hardware.
>
>Implement the new API in flower classifier. Use it to manage hw_filters
>list under cb_lock protection, instead of relying on rtnl lock to
>synchronize with concurrent fl_reoffload() call.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v3 03/10] net: sched: refactor block offloads counter usage
From: Jiri Pirko @ 2019-08-26 15:46 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, davem, jakub.kicinski, pablo
In-Reply-To: <20190826134506.9705-4-vladbu@mellanox.com>

Mon, Aug 26, 2019 at 03:44:59PM CEST, vladbu@mellanox.com wrote:
>Without rtnl lock protection filters can no longer safely manage block
>offloads counter themselves. Refactor cls API to protect block offloadcnt
>with tcf_block->cb_lock that is already used to protect driver callback
>list and nooffloaddevcnt counter. The counter can be modified by concurrent
>tasks by new functions that execute block callbacks (which is safe with
>previous patch that changed its type to atomic_t), however, block
>bind/unbind code that checks the counter value takes cb_lock in write mode
>to exclude any concurrent modifications. This approach prevents race
>conditions between bind/unbind and callback execution code but allows for
>concurrency for tc rule update path.
>
>Move block offload counter, filter in hardware counter and filter flags
>management from classifiers into cls hardware offloads API. Make functions
>tcf_block_offload_{inc|dec}() and tc_cls_offload_cnt_update() to be cls API
>private. Implement following new cls API to be used instead:
>
>  tc_setup_cb_add() - non-destructive filter add. If filter that wasn't
>  already in hardware is successfully offloaded, increment block offloads
>  counter, set filter in hardware counter and flag. On failure, previously
>  offloaded filter is considered to be intact and offloads counter is not
>  decremented.
>
>  tc_setup_cb_replace() - destructive filter replace. Release existing
>  filter block offload counter and reset its in hardware counter and flag.
>  Set new filter in hardware counter and flag. On failure, previously
>  offloaded filter is considered to be destroyed and offload counter is
>  decremented.
>
>  tc_setup_cb_destroy() - filter destroy. Unconditionally decrement block
>  offloads counter.
>
>  tc_setup_cb_reoffload() - reoffload filter to single cb. Execute cb() and
>  call tc_cls_offload_cnt_update() if cb() didn't return an error.
>
>Refactor all offload-capable classifiers to atomically offload filters to
>hardware, change block offload counter, and set filter in hardware counter
>and flag by means of the new cls API functions.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v3 01/10] net: sched: protect block offload-related fields with rw_semaphore
From: Jiri Pirko @ 2019-08-26 15:39 UTC (permalink / raw)
  To: Vlad Buslov; +Cc: netdev, jhs, xiyou.wangcong, davem, jakub.kicinski, pablo
In-Reply-To: <20190826134506.9705-2-vladbu@mellanox.com>

Mon, Aug 26, 2019 at 03:44:57PM CEST, vladbu@mellanox.com wrote:
>In order to remove dependency on rtnl lock, extend tcf_block with 'cb_lock'
>rwsem and use it to protect flow_block->cb_list and related counters from
>concurrent modification. The lock is taken in read mode for read-only
>traversal of cb_list in tc_setup_cb_call() and write mode in all other
>cases. This approach ensures that:
>
>- cb_list is not changed concurrently while filters is being offloaded on
>  block.
>
>- block->nooffloaddevcnt is checked while holding the lock in read mode,
>  but is only changed by bind/unbind code when holding the cb_lock in write
>  mode to prevent concurrent modification.
>
>Signed-off-by: Vlad Buslov <vladbu@mellanox.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [PATCH net-next v4 6/6] net: dsa: mv88e6xxx: fully support SERDES on Topaz family
From: Andrew Lunn @ 2019-08-26 15:38 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826122109.20660-7-marek.behun@nic.cz>

> +static int mv88e6xxx_port_set_cmode(struct mv88e6xxx_chip *chip, int port,
> +				    phy_interface_t mode, bool allow_over_2500,
> +				    bool make_cmode_writable)

I don't like these two parameters. The caller of this function can do
the check for allow_over_2500 and error out before calling this.

Is make_cmode_writable something that could be done once at probe and
then forgotten about? Or is it needed before every write? At least
move it into the specific port_set_cmode() that requires it.

Thanks
	Andrew

^ permalink raw reply

* Re: [PATCH net-next v4 4/6] net: dsa: mv88e6xxx: simplify SERDES code for Topaz and Peridot
From: Andrew Lunn @ 2019-08-26 15:29 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826122109.20660-5-marek.behun@nic.cz>

On Mon, Aug 26, 2019 at 02:21:07PM +0200, Marek Behún wrote:
> By adding an additional serdes_get_lane implementation (for Topaz), we
> can merge the implementations of other SERDES functions (powering and
> IRQs). We can skip checking port numbers, since the serdes_get_lane()
> methods inform if there is no lane on a port or if the lane cannot be
> used for given cmode.
> 
> Signed-off-by: Marek Behún <marek.behun@nic.cz>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH net-next v4 3/6] net: dsa: mv88e6xxx: create serdes_get_lane chip operation
From: Andrew Lunn @ 2019-08-26 15:28 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826122109.20660-4-marek.behun@nic.cz>

> -int mv88e6390x_serdes_get_lane(struct mv88e6xxx_chip *chip, int port)
> +int mv88e6390x_serdes_get_lane(struct mv88e6xxx_chip *chip, int port, s8 *lane)
>  {
>  	u8 cmode_port9, cmode_port10, cmode_port;
>  
> @@ -323,76 +320,80 @@ int mv88e6390x_serdes_get_lane(struct mv88e6xxx_chip *chip, int port)
>  	cmode_port10 = chip->ports[10].cmode;
>  	cmode_port = chip->ports[port].cmode;
>  
> +	*lane = -1;
> +

You could move that into mv88e8xxx_serdes_get_lane().

Otherwise

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/4] xsk: add proper barriers and {READ, WRITE}_ONCE-correctness for state
From: Ilya Maximets @ 2019-08-26 15:24 UTC (permalink / raw)
  To: Björn Töpel, ast, daniel, netdev
  Cc: Björn Töpel, magnus.karlsson, magnus.karlsson, bpf,
	jonathan.lemon, syzbot+c82697e3043781e08802, hdanton
In-Reply-To: <20190826061053.15996-3-bjorn.topel@gmail.com>

On 26.08.2019 9:10, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> The state variable was read, and written outside the control mutex
> (struct xdp_sock, mutex), without proper barriers and {READ,
> WRITE}_ONCE correctness.
> 
> In this commit this issue is addressed, and the state member is now
> used a point of synchronization whether the socket is setup correctly
> or not.
> 
> This also fixes a race, found by syzcaller, in xsk_poll() where umem
> could be accessed when stale.
> 
> Suggested-by: Hillf Danton <hdanton@sina.com>
> Reported-by: syzbot+c82697e3043781e08802@syzkaller.appspotmail.com
> Fixes: 77cd0d7b3f25 ("xsk: add support for need_wakeup flag in AF_XDP rings")
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  net/xdp/xsk.c | 57 ++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 40 insertions(+), 17 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index f3351013c2a5..8fafa3ce3ae6 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -162,10 +162,23 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>  	return err;
>  }
>  
> +static bool xsk_is_bound(struct xdp_sock *xs)
> +{
> +	if (READ_ONCE(xs->state) == XSK_BOUND) {
> +		/* Matches smp_wmb() in bind(). */
> +		smp_rmb();
> +		return true;
> +	}
> +	return false;
> +}
> +
>  int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>  {
>  	u32 len;
>  
> +	if (!xsk_is_bound(xs))
> +		return -EINVAL;
> +
>  	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
>  		return -EINVAL;
>  
> @@ -362,7 +375,7 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
>  	struct sock *sk = sock->sk;
>  	struct xdp_sock *xs = xdp_sk(sk);
>  
> -	if (unlikely(!xs->dev))
> +	if (unlikely(!xsk_is_bound(xs)))
>  		return -ENXIO;
>  	if (unlikely(!(xs->dev->flags & IFF_UP)))
>  		return -ENETDOWN;
> @@ -378,10 +391,15 @@ static unsigned int xsk_poll(struct file *file, struct socket *sock,
>  			     struct poll_table_struct *wait)
>  {
>  	unsigned int mask = datagram_poll(file, sock, wait);
> -	struct sock *sk = sock->sk;
> -	struct xdp_sock *xs = xdp_sk(sk);
> -	struct net_device *dev = xs->dev;
> -	struct xdp_umem *umem = xs->umem;
> +	struct xdp_sock *xs = xdp_sk(sock->sk);
> +	struct net_device *dev;
> +	struct xdp_umem *umem;
> +
> +	if (unlikely(!xsk_is_bound(xs)))
> +		return mask;
> +
> +	dev = xs->dev;
> +	umem = xs->umem;
>  
>  	if (umem->need_wakeup)
>  		dev->netdev_ops->ndo_xsk_wakeup(dev, xs->queue_id,
> @@ -417,10 +435,9 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
>  {
>  	struct net_device *dev = xs->dev;
>  
> -	if (!dev || xs->state != XSK_BOUND)
> +	if (xs->state != XSK_BOUND)
>  		return;
> -
> -	xs->state = XSK_UNBOUND;
> +	WRITE_ONCE(xs->state, XSK_UNBOUND);
>  
>  	/* Wait for driver to stop using the xdp socket. */
>  	xdp_del_sk_umem(xs->umem, xs);
> @@ -495,7 +512,9 @@ static int xsk_release(struct socket *sock)
>  	local_bh_enable();
>  
>  	xsk_delete_from_maps(xs);
> +	mutex_lock(&xs->mutex);
>  	xsk_unbind_dev(xs);
> +	mutex_unlock(&xs->mutex);
>  
>  	xskq_destroy(xs->rx);
>  	xskq_destroy(xs->tx);
> @@ -589,19 +608,18 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>  		}
>  
>  		umem_xs = xdp_sk(sock->sk);
> -		if (!umem_xs->umem) {
> -			/* No umem to inherit. */
> +		if (!xsk_is_bound(umem_xs)) {

This changes the error code a bit.
Previously:
   umem exists + xs unbound    --> EINVAL
   no umem     + xs unbound    --> EBADF
   xs bound to different dev/q --> EINVAL

With this change:
   umem exists + xs unbound    --> EBADF
   no umem     + xs unbound    --> EBADF
   xs bound to different dev/q --> EINVAL

Just a note. Not sure if this is important.


>  			err = -EBADF;
>  			sockfd_put(sock);
>  			goto out_unlock;
> -		} else if (umem_xs->dev != dev || umem_xs->queue_id != qid) {
> +		}
> +		if (umem_xs->dev != dev || umem_xs->queue_id != qid) {
>  			err = -EINVAL;
>  			sockfd_put(sock);
>  			goto out_unlock;
>  		}
> -
>  		xdp_get_umem(umem_xs->umem);
> -		xs->umem = umem_xs->umem;
> +		WRITE_ONCE(xs->umem, umem_xs->umem);
>  		sockfd_put(sock);
>  	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
>  		err = -EINVAL;
> @@ -626,10 +644,15 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
>  	xdp_add_sk_umem(xs->umem, xs);
>  
>  out_unlock:
> -	if (err)
> +	if (err) {
>  		dev_put(dev);
> -	else
> -		xs->state = XSK_BOUND;
> +	} else {
> +		/* Matches smp_rmb() in bind() for shared umem
> +		 * sockets, and xsk_is_bound().
> +		 */
> +		smp_wmb();
> +		WRITE_ONCE(xs->state, XSK_BOUND);
> +	}
>  out_release:
>  	mutex_unlock(&xs->mutex);
>  	rtnl_unlock();
> @@ -869,7 +892,7 @@ static int xsk_mmap(struct file *file, struct socket *sock,
>  	unsigned long pfn;
>  	struct page *qpg;
>  
> -	if (xs->state != XSK_READY)
> +	if (READ_ONCE(xs->state) != XSK_READY)
>  		return -EBUSY;
>  
>  	if (offset == XDP_PGOFF_RX_RING) {
> 

^ permalink raw reply

* Re: [PATCH net-next v4 2/6] net: dsa: mv88e6xxx: update code operating on hidden registers
From: Andrew Lunn @ 2019-08-26 15:20 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Vivien Didelot, Florian Fainelli, Vladimir Oltean
In-Reply-To: <20190826122109.20660-3-marek.behun@nic.cz>

On Mon, Aug 26, 2019 at 02:21:05PM +0200, Marek Behún wrote:
> This patch moves the functions operating on the hidden debug registers
> into it's own file, port_hidden.c. The functions prefix is renamed from
> mv88e6390_hidden_ to mv88e6xxx_port_hidden_, to be consistent with the
> rest of this driver. The macros are prefixed with MV88E6XXX_ prefix, and
> are changed not to use the BIT() macro nor bit shifts, since the rest of
> the port.h file does not use it.
> 
> We also add the support for setting the Block Address field when
> operating hidden registers. Marvell's mdio examples for SERDES settings
> on Topaz use Block Address 0x7 when reading/writing hidden registers,
> and although the specification says that block must be set to 0xf, those
> settings are reachable only with Block Address 0x7.
> 
> Signed-off-by: Marek Behún <marek.behun@nic.cz>

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH v2 net-next 2/2] net: dsa: tag_8021q: Restore bridge VLANs when enabling vlan_filtering
From: Vivien Didelot @ 2019-08-26 15:20 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: f.fainelli, andrew, idosch, roopa, nikolay, davem, netdev,
	Vladimir Oltean
In-Reply-To: <20190825184454.14678-3-olteanv@gmail.com>

Hi Vladimir,

On Sun, 25 Aug 2019 21:44:54 +0300, Vladimir Oltean <olteanv@gmail.com> wrote:
> -	if (enabled)
> -		err = dsa_port_vid_add(upstream_dp, tx_vid, 0);
> -	else
> -		err = dsa_port_vid_del(upstream_dp, tx_vid);
> +	err = dsa_8021q_vid_apply(ds, upstream, tx_vid, 0, enabled);
>  	if (err) {
>  		dev_err(ds->dev, "Failed to apply TX VID %d on port %d: %d\n",
>  			tx_vid, upstream, err);
>  		return err;
>  	}
>  
> -	return 0;
> +	if (!enabled)
> +		err = dsa_8021q_restore_pvid(ds, port);
> +
> +	return err;
>  }

I did not dig that much into tag_8021q.c yet. From seeing this portion,
I'm just wondering if these two helpers couldn't be part of the same logic
as they both act upon the "enabled" condition?

Otherwise I have no complains about the series.


Thanks,

	Vivien

^ permalink raw reply

* Re: kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call without frame pointer save/setup
From: Josh Poimboeuf @ 2019-08-26 15:18 UTC (permalink / raw)
  To: He Zhe
  Cc: ast, daniel, kafai, songliubraving, yhs, ndesaulniers,
	miguel.ojeda.sandonis, luc.vanoostenryck, schwidefsky, gregkh,
	mst, gor, andreyknvl, liuxiaozhou, yamada.masahiro, linux-kernel,
	netdev, bpf
In-Reply-To: <cf0273fb-c272-72be-50f9-b25bb7c7f183@windriver.com>

On Mon, Aug 26, 2019 at 10:42:53PM +0800, He Zhe wrote:
> Hi All,
> 
> Since 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()"),
> We have got the following warning,
> kernel/bpf/core.o: warning: objtool: ___bpf_prog_run.cold()+0x7: call without frame pointer save/setup
> 
> If reverting the above commit, we will get the following warning,
> kernel/bpf/core.o: warning: objtool: ___bpf_prog_run()+0x8b9: sibling call from callable instruction with modified stack frame
> if CONFIG_RETPOLINE=n, and no warning if CONFIG_RETPOLINE=y

Can you please share the following:

- core.o file

The following would also be helpful for me to try to recreate it:

- config file
- compiler version
- kernel version

-- 
Josh

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox