Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next 3/3] bpftool: support loading flow dissector
From: Stanislav Fomichev @ 2018-11-07 23:34 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	quentin.monnet, guro, jiong.wang, bhole_prashant_q7,
	john.fastabend, jbenc, treeze.taeung, yhs, osk, sandipan
In-Reply-To: <20181107152917.729ea83c@cakuba.netronome.com>

On 11/07, Jakub Kicinski wrote:
> On Wed, 7 Nov 2018 15:13:33 -0800, Stanislav Fomichev wrote:
> > On 11/07, Jakub Kicinski wrote:
> > > On Wed,  7 Nov 2018 14:43:56 -0800, Stanislav Fomichev wrote:  
> > > > bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
> > > >         key 0 0 0 0 \
> > > >         value pinned /sys/fs/bpf/flow/IP/0  
> > > 
> > > Where is that /0 coming from ?  Is that in source code?  I don't see
> > > libbpf adding it, maybe I'm missing something.  
> > libbpf adds that, that's a program instance:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/bpf/libbpf.c#n1744
> 
> Ugh, I was looking at bpf_object__pin() which uses names :(
> 
> We never use this multi-instance thing, and I don't think bpftool ever
> will, so IMHO it'd be good if we just re-did the pinning loop in
> bpftool.
I wonder whether I should just add special case to bpf_program__pin: don't
create a subdir when instances.nr == 1 (and just create a file pin for
single instance)? In that case I can continue to use libbpf and don't reinvent
the wheel. Any objections?

^ permalink raw reply

* Re: net: ipv4: tcp_westwood: fixed warnings and checks
From: Suraj Singh @ 2018-11-08  9:08 UTC (permalink / raw)
  To: davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, Suraj Singh
In-Reply-To: <1541425985-31869-1-git-send-email-suraj1998@gmail.com--annotate>

From: Suraj Singh <suraj998@gmail.com>

Regrding why I used "staging: " in the commmit message, I was following Greg Kroah-Hartman's video on YouTube on how to submit your first patch, and in his sample commit, he'd started his commit message with "staging: ", and so I thought it was convention to do so. I'll remove that immediately. I made this same mistake in another patch that I just sent for TCP Vegas, I'll make the change there as well.

I didn't consider the complexities of calling the same function twice. I was looking more towards satisying the scriptpatch.pl's requirements. 

-		tp->snd_cwnd = tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk);
+		tp->snd_cwnd = tcp_westwood_bw_rttmin(sk);
+		tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk);

I've made the same mistake here. I'll make these changes and resubmit. Is there anything else that's wrong? 

^ permalink raw reply

* Re: [PATCH bpf-next] xdp: sample code for redirecting vlan packets to specific cpus
From: Shannon Nelson @ 2018-11-07 23:37 UTC (permalink / raw)
  To: daniel
  Cc: john.fastabend, Shannon Nelson, ast, netdev, eric.dumazet,
	silviu.smarandache
In-Reply-To: <a89d3362-ced4-2851-05ce-98028f558b2f@iogearbox.net>

On Wed, Nov 7, 2018 at 1:44 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 10/29/2018 11:11 PM, John Fastabend wrote:
> > On 10/29/2018 02:19 PM, Shannon Nelson wrote:
> >> This is an example of using XDP to redirect the processing of
> >> particular vlan packets to specific CPUs.  This is in response
> >> to comments received on a kernel patch put forth previously
> >> to do something similar using RPS.
> >>      https://www.spinics.net/lists/netdev/msg528210.html
> >>      [PATCH net-next] net: enable RPS on vlan devices
> >>
> >> This XDP application watches for inbound vlan-tagged packets
> >> and redirects those packets to be processed on a specific CPU
> >> as configured in a BPF map.  The BPF map can be modified by
> >> this user program, which can also load and unload the kernel
> >> XDP code.
> >>
> >> One example use is for supporting VMs where we can't control the
> >> OS being used: we'd like to separate the VM CPU processing from
> >> the host's CPUs as a way to help mitigate L1TF related issues.
> >> When running the VM's traffic on a vlan we can stick the host's
> >> Rx processing on one set of CPUs separate from the VM's CPUs.
> >>
> >> This example currently uses a vlan key and cpu value in the
> >> BPF map, so only can do one CPU per vlan.  This could easily
> >> be modified to use a bitpattern of CPUs rather than a CPU id
> >> to allow multiple CPUs per vlan.
> >
> > Great, so does this solve your use case then? At least on drivers
> > with XDP support?
> >
> >>
> >> Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
> >> ---
> >
> > Some really small and trivial nits below.
> >
> > Acked-by: John Fastabend <john.fastabend@gmail.com>
> >
> > [...]
> >
> >> +    if (install) {
> >> +
> >
> > new line probably not needed.
> >
> >> +            /* check to see if already installed */
> >> +            errno = 0;
> >> +            access(pin_prog_name, R_OK);
> >> +            if (errno != ENOENT) {
> >> +                    fprintf(stderr, "ERR: %s is already installed\n", argv[0]);
> >> +                    return -1;
> >> +            }
> >> +
> >> +            /* load the XDP program and maps with the convenient library */
> >> +            if (load_bpf_file(filename)) {
> >> +                    fprintf(stderr, "ERR: load_bpf_file(%s): \n%s",
> >> +                            filename, bpf_log_buf);
> >> +                    return -1;
> >> +            }
> >> +            if (!prog_fd[0]) {
> >> +                    fprintf(stderr, "ERR: load_bpf_file(%s): %d %s\n",
> >> +                            filename, errno, strerror(errno));
> >> +                    return -1;
> >> +            }
> >> +
> >> +            /* pin the XDP program and maps */
> >> +            if (bpf_obj_pin(prog_fd[0], pin_prog_name) < 0) {
> >> +                    fprintf(stderr, "ERR: bpf_obj_pin(%s): %d %s\n",
> >> +                            pin_prog_name, errno, strerror(errno));
> >> +                    if (errno == 2)
> >> +                            fprintf(stderr, "     (is the BPF fs mounted on /sys/fs/bpf?)\n");
> >> +                    return -1;
> >> +            }
> >> +            if (bpf_obj_pin(map_fd[0], pin_vlanmap_name) < 0) {
> >> +                    fprintf(stderr, "ERR: bpf_obj_pin(%s): %d %s\n",
> >> +                            pin_vlanmap_name, errno, strerror(errno));
> >> +                    return -1;
> >> +            }
> >> +            if (bpf_obj_pin(map_fd[2], pin_countermap_name) < 0) {
> >> +                    fprintf(stderr, "ERR: bpf_obj_pin(%s): %d %s\n",
> >> +                            pin_countermap_name, errno, strerror(errno));
> >> +                    return -1;
> >> +            }
> >> +
> >> +            /* prep the vlan map with "not used" values */
> >> +            c64 = UNDEF_CPU;
> >> +            for (v64 = 0; v64 < 4096; v64++) {
> >
> > maybe #define MAX_VLANS 4096 just to avoid constants.
> >
> >> +                    if (bpf_map_update_elem(map_fd[0], &v64, &c64, 0)) {
> >> +                            fprintf(stderr, "ERR: preping vlan map failed on v=%llu: %d %s\n",
> >> +                                    v64, errno, strerror(errno));
> >> +                            return -1;
> >> +                    }
> >> +            }
> >> +
> >> +            /* prep the cpumap with queue sizes */
> >> +            c64 = 128+64;  /* see note in xdp_redirect_cpu_user.c */
> >> +            for (v64 = 0; v64 < MAX_CPUS; v64++) {
> >> +                    if (bpf_map_update_elem(map_fd[1], &v64, &c64, 0)) {
> >> +                            if (errno == ENODEV) {
> >> +                                    /* Save the last CPU number attempted
> >> +                                     * into the counters map
> >> +                                     */
> >> +                                    c64 = CPU_COUNT;
> >> +                                    ret = bpf_map_update_elem(map_fd[2], &c64, &v64, 0);
> >> +                                    break;
> >> +                            }
> >> +
> >> +                            fprintf(stderr, "ERR: preping cpu map failed on v=%llu: %d %s\n",
> >> +                                    v64, errno, strerror(errno));
> >> +                            return -1;
> >> +                    }
> >> +            }
> >> +
> >> +            /* wire the XDP program to the device */
> >> +            if (bpf_set_link_xdp_fd(ifindex, prog_fd[0], 0) < 0) {
> >> +                    fprintf(stderr, "ERR: bpf_set_link_xdp_fd(): %d %s\n",
> >> +                            errno, strerror(errno));
> >> +                    return -1;
> >> +            }
> >> +
> >> +            return 0;
> >> +    }
> >> +
> >> +    if (remove) {
> >> +
>
> Ditto
>
> >> +            /* unlink the program from the device */
> >> +            if (bpf_set_link_xdp_fd(ifindex, -1, 0) < 0)
> >> +                    fprintf(stderr, "ERR: bpf_set_link_xdp_fd(): %d %s\n",
> >> +                            errno, strerror(errno));
> >> +
> >> +            /* unlink pinned files */
> >> +            if (unlink(pin_prog_name))
> >> +                    fprintf(stderr, "ERR: unlink(%s): %d %s\n",
> >> +                            pin_prog_name, errno, strerror(errno));
> >> +            if (unlink(pin_vlanmap_name))
> >> +                    fprintf(stderr, "ERR: unlink(%s): %d %s\n",
> >> +                            pin_vlanmap_name, errno, strerror(errno));
> >> +            if (unlink(pin_countermap_name))
> >> +                    fprintf(stderr, "ERR: unlink(%s): %d %s\n",
> >> +                            pin_countermap_name, errno, strerror(errno));
> >> +
> >> +            return 0;
> >> +    }
> >> +
> >> +    if (vlan == 0) {
> >> +            fprintf(stderr, "ERR: required option --vlan missing\n");
> >> +            goto error;
> >> +    }
> >> +
> >> +    if (cpu == MAX_CPUS && vlan > 0) {
> >> +            fprintf(stderr, "ERR: required option --cpu missing\n");
> >> +            goto error;
> >> +    }
> >> +
> >> +    vfd = bpf_obj_get(pin_vlanmap_name);
> >> +    if (vfd < 0) {
> >> +            fprintf(stderr, "ERR: can't find pinned map %s: %d %s\n",
> >> +                    pin_vlanmap_name, errno, strerror(errno));
> >> +            if (errno == ENOENT)
> >> +                    fprintf(stderr, "   (has %s been installed yet?)\n", argv[0]);
> >> +            return -1;
> >> +    }
> >> +
> >> +    /* decode the requested action */
> >> +    if (vlan > 0) {
> >> +            /* check cpu against the max value found */
> >> +            cfd = bpf_obj_get(pin_countermap_name);
> >> +            if (cfd < 0) {
> >> +                    fprintf(stderr, "ERR: can't find pinned map %s: %d %s\n",
> >> +                            pin_countermap_name, errno, strerror(errno));
> >> +                    return -1;
> >> +            }
> >> +            c64 = CPU_COUNT;
> >> +            ret = bpf_map_lookup_elem(cfd, &c64, &v64);
> >> +            if (cpu >= v64) {
> >
> > Need to check ret code?
>
> Yeah, I expect this would have another respin, thanks!

Being as I no longer work for Oracle (as of this week), and I'm deep
into the spin-up of learning a new company, this won't be happening
right away.  Let's consider this an RFC, and I will try to follow up
in the "near" future.

Thanks
sln

>
> >> +                    fprintf(stderr, "ERR: cpu %d greater than max %llu\n", cpu, v64);
> >> +                    return -1;
> >> +            }
> >> +
> >> +            /* Note that the value and key pointers really do need to be
> >> +             * pointers to 64-bit values, else things get a bit muddled.
> >> +             */
> >> +            v64 = vlan;
> >> +            c64 = cpu;
> >> +            ret = bpf_map_update_elem(vfd, &v64, &c64, 0);
> >> +            if (ret) {
> >> +                    fprintf(stderr, "Adding vlan %d CPU %d failed: %d %s\n",
> >> +                            vlan, cpu, errno, strerror(errno));
> >> +                    return -1;
> >> +            }
> >> +
> >> +    } else {
> >> +            v64 = -vlan;
> >> +            c64 = UNDEF_CPU;
> >> +
> >> +            /* We can't actually delete from a TYPE_ARRAY map, so we
> >> +             * simply set it to an undefined value.
> >> +             */
> >> +            ret = bpf_map_update_elem(vfd, &v64, &c64, 0);
> >> +            if (ret) {
> >> +                    fprintf(stderr, "Delete of vlan %llu failed: %d %s\n",
> >> +                            v64, errno, strerror(errno));
> >> +                    return -1;
> >> +            }
> >> +    }
> >> +
> >> +    return 0;
> >> +
> >> +error:
> >> +    usage(argv);
> >> +    return -1;
> >> +}
> >>
> >
>


-- 
==============================================
Mr. Shannon Nelson         Parents can't afford to be squeamish.

^ permalink raw reply

* Re: [PATCH bpf-next 2/3] libbpf: cleanup after partial failure in bpf_object__pin
From: Jakub Kicinski @ 2018-11-07 23:38 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	quentin.monnet, guro, jiong.wang, bhole_prashant_q7,
	john.fastabend, jbenc, treeze.taeung, yhs, osk, sandipan
In-Reply-To: <20181107232516.qiy24ccbq3hh4ail@mini-arch>

On Wed, 7 Nov 2018 15:25:16 -0800, Stanislav Fomichev wrote:
> On 11/07, Jakub Kicinski wrote:
> > On Wed, 7 Nov 2018 15:00:21 -0800, Stanislav Fomichev wrote:  
> > > > > +err_unpin_programs:
> > > > > +	bpf_object__for_each_program(prog, obj) {
> > > > > +		char buf[PATH_MAX];
> > > > > +		int len;
> > > > > +
> > > > > +		len = snprintf(buf, PATH_MAX, "%s/%s", path,
> > > > > +			       prog->section_name);
> > > > > +		if (len < 0)
> > > > > +			continue;
> > > > > +		else if (len >= PATH_MAX)
> > > > > +			continue;
> > > > > +
> > > > > +		unlink(buf);    
> > > > 
> > > > I think that's no bueno, if pin failed because the file already exists
> > > > you'll now remove that already existing file.    
> > >
> > > How about we check beforehand and bail early if we are going to
> > > overwrite something?  
> > 
> > Possible, although the most common way to handle situation like this in
> > the kernel is to "continue the iteration in reverse" over the list.
> > I.e. walk the list back.  I think the objects are on a double linked
> > list.  You may need to add the appropriate foreach macro and helper..  
>
> That sounds more complicated than just ensuring that the top directory
> for the pins doesn't exist and then rm -rf it on failure.

Why would we require that the directory does not exist?  We can
check if it exists and then either create or just pin all in an existing
one.

I don't think it should be that much effort to write a reverse for
loop - it could actually be less LoC than that rm_rf function :)

> I'm thinking about copy-pasting rm_rf from perf
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/util.c#n119).
> Thoughts?
>
> Btw, current patch won't work because of those /0 added by bpf_program__pin.

^ permalink raw reply

* Re: [PATCH bpf-next 3/3] bpftool: support loading flow dissector
From: Jakub Kicinski @ 2018-11-07 23:41 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	quentin.monnet, guro, jiong.wang, bhole_prashant_q7,
	john.fastabend, jbenc, treeze.taeung, yhs, osk, sandipan
In-Reply-To: <20181107233448.r7vcnxtdzfnzegas@mini-arch>

On Wed, 7 Nov 2018 15:34:48 -0800, Stanislav Fomichev wrote:
> On 11/07, Jakub Kicinski wrote:
> > On Wed, 7 Nov 2018 15:13:33 -0800, Stanislav Fomichev wrote:  
> > > On 11/07, Jakub Kicinski wrote:  
> > > > On Wed,  7 Nov 2018 14:43:56 -0800, Stanislav Fomichev wrote:    
> > > > > bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
> > > > >         key 0 0 0 0 \
> > > > >         value pinned /sys/fs/bpf/flow/IP/0    
> > > > 
> > > > Where is that /0 coming from ?  Is that in source code?  I don't see
> > > > libbpf adding it, maybe I'm missing something.    
> > > libbpf adds that, that's a program instance:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/bpf/libbpf.c#n1744  
> > 
> > Ugh, I was looking at bpf_object__pin() which uses names :(
> > 
> > We never use this multi-instance thing, and I don't think bpftool ever
> > will, so IMHO it'd be good if we just re-did the pinning loop in
> > bpftool.  
> I wonder whether I should just add special case to bpf_program__pin: don't
> create a subdir when instances.nr == 1 (and just create a file pin for
> single instance)? In that case I can continue to use libbpf and don't reinvent
> the wheel. Any objections?

Mm.. I'm afraid libbpf needs to keep backward compatibility.  We'd have
to add some way for the user (bpftool code) to request the instance ID
does not appear, but (potential) existing users should keep seeing them.
Perhaps others disagree.

^ permalink raw reply

* Re: net: ipv4: tcp_westwood: fixed warnings and checks
From: Suraj Singh @ 2018-11-08  9:16 UTC (permalink / raw)
  To: davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, suraj1998, Suraj Singh
In-Reply-To: <1541425985-31869-1-git-send-email-suraj1998@gmail.com>

From: Suraj Singh <suraj998@gmail.com>

Regrding why I used "staging: " in the commmit message, I was following Greg Kroah-Hartman's video on YouTube on how to submit your first patch, and in his sample commit, he'd started his commit message with "staging: ", and so I thought it was convention to do so. I'll remove that immediately. I made this same mistake in another patch that I just sent for TCP Vegas, I'll make the change there as well.

I didn't consider the complexities of calling the same function twice. I was looking more towards satisying the scriptpatch.pl's requirements. 

-		tp->snd_cwnd = tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk);
+		tp->snd_cwnd = tcp_westwood_bw_rttmin(sk);
+		tp->snd_ssthresh = tcp_westwood_bw_rttmin(sk);

I've made the same mistake here. I'll make these changes and resubmit. Is there anything else that's wrong? 

^ permalink raw reply

* [PATCH] staging: net: ipv4: tcp_westwood: fixed warnings and checks
From: Suraj Singh @ 2018-11-08  9:46 UTC (permalink / raw)
  To: davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, suraj1998
In-Reply-To: <1541425985-31869-1-git-send-email-suraj1998@gmail.com>

Fixed warnings and checks for TCP Westwood

Signed-off-by: Suraj Singh <suraj1998@gmail.com>
---
 net/ipv4/tcp_westwood.c | 31 ++++++++++++-------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/net/ipv4/tcp_westwood.c b/net/ipv4/tcp_westwood.c
index bec9caf..8879152 100644
--- a/net/ipv4/tcp_westwood.c
+++ b/net/ipv4/tcp_westwood.c
@@ -43,11 +43,10 @@ struct westwood {
 };
 
 /* TCP Westwood functions and constants */
-#define TCP_WESTWOOD_RTT_MIN   (HZ/20)	/* 50ms */
-#define TCP_WESTWOOD_INIT_RTT  (20*HZ)	/* maybe too conservative?! */
+#define TCP_WESTWOOD_RTT_MIN   (HZ / 20)	/* 50ms */
+#define TCP_WESTWOOD_INIT_RTT  (20 * HZ)	/* maybe too conservative?! */
 
-/*
- * @tcp_westwood_create
+/* @tcp_westwood_create
  * This function initializes fields used in TCP Westwood+,
  * it is called after the initial SYN, so the sequence numbers
  * are correct but new passive connections we have no
@@ -73,8 +72,7 @@ static void tcp_westwood_init(struct sock *sk)
 	w->first_ack = 1;
 }
 
-/*
- * @westwood_do_filter
+/* @westwood_do_filter
  * Low-pass filter. Implemented using constant coefficients.
  */
 static inline u32 westwood_do_filter(u32 a, u32 b)
@@ -94,8 +92,7 @@ static void westwood_filter(struct westwood *w, u32 delta)
 	}
 }
 
-/*
- * @westwood_pkts_acked
+/* @westwood_pkts_acked
  * Called after processing group of packets.
  * but all westwood needs is the last sample of srtt.
  */
@@ -108,8 +105,7 @@ static void tcp_westwood_pkts_acked(struct sock *sk,
 		w->rtt = usecs_to_jiffies(sample->rtt_us);
 }
 
-/*
- * @westwood_update_window
+/* @westwood_update_window
  * It updates RTT evaluation window if it is the right moment to do
  * it. If so it calls filter for evaluating bandwidth.
  */
@@ -127,8 +123,7 @@ static void westwood_update_window(struct sock *sk)
 		w->first_ack = 0;
 	}
 
-	/*
-	 * See if a RTT-window has passed.
+	/* See if a RTT-window has passed.
 	 * Be careful since if RTT is less than
 	 * 50ms we don't filter but we continue 'building the sample'.
 	 * This minimum limit was chosen since an estimation on small
@@ -149,12 +144,12 @@ static inline void update_rtt_min(struct westwood *w)
 	if (w->reset_rtt_min) {
 		w->rtt_min = w->rtt;
 		w->reset_rtt_min = 0;
-	} else
+	} else {
 		w->rtt_min = min(w->rtt, w->rtt_min);
+	}
 }
 
-/*
- * @westwood_fast_bw
+/* @westwood_fast_bw
  * It is called when we are in fast path. In particular it is called when
  * header prediction is successful. In such case in fact update is
  * straight forward and doesn't need any particular care.
@@ -171,8 +166,7 @@ static inline void westwood_fast_bw(struct sock *sk)
 	update_rtt_min(w);
 }
 
-/*
- * @westwood_acked_count
+/* @westwood_acked_count
  * This function evaluates cumul_ack for evaluating bk in case of
  * delayed or partial acks.
  */
@@ -207,8 +201,7 @@ static inline u32 westwood_acked_count(struct sock *sk)
 	return w->cumul_ack;
 }
 
-/*
- * TCP Westwood
+/* TCP Westwood
  * Here limit is evaluated as Bw estimation*RTTmin (for obtaining it
  * in packets we use mss_cache). Rttmin is guaranteed to be >= 2
  * so avoids ever returning 0.
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH net-next v5 0/9] vrf: allow simultaneous service instances in default and other VRFs
From: David Miller @ 2018-11-08  0:13 UTC (permalink / raw)
  To: mmanning; +Cc: netdev
In-Reply-To: <20181107153610.7526-1-mmanning@vyatta.att-mail.com>

From: Mike Manning <mmanning@vyatta.att-mail.com>
Date: Wed,  7 Nov 2018 15:36:01 +0000

> Services currently have to be VRF-aware if they are using an unbound
> socket. One cannot have multiple service instances running in the
> default and other VRFs for services that are not VRF-aware and listen
> on an unbound socket. This is because there is no easy way of isolating
> packets received in the default VRF from those arriving in other VRFs.
> 
> This series provides this isolation for stream sockets subject to the
> existing kernel parameter net.ipv4.tcp_l3mdev_accept not being set,
> given that this is documented as allowing a single service instance to
> work across all VRF domains. Similarly, net.ipv4.udp_l3mdev_accept is
> checked for datagram sockets, and net.ipv4.raw_l3mdev_accept is
> introduced for raw sockets. The functionality applies to UDP & TCP
> services as well as those using raw sockets, and is for IPv4 and IPv6.
> 
> Example of running ssh instances in default and blue VRF:
 ...

Series applied, thanks Mike.

^ permalink raw reply

* Re: [PATCH] [stable, netdev 4.4+] lan78xx: make sure RX_ADDRL & RX_ADDRH regs are always up to date
From: Sasha Levin @ 2018-11-08  0:17 UTC (permalink / raw)
  To: Paolo Pisati
  Cc: Woojung Huh, Microchip Linux Driver Support, netdev, stable,
	linux-usb, linux-kernel
In-Reply-To: <1541609457-28725-1-git-send-email-p.pisati@gmail.com>

On Wed, Nov 07, 2018 at 05:50:57PM +0100, Paolo Pisati wrote:
>[partial backport upstream 760db29bdc97b73ff60b091315ad787b1deb5cf5]
>
>Upon invocation, lan78xx_init_mac_address() checks that the mac address present
>in the RX_ADDRL & RX_ADDRH registers is a valid address, if not, it first tries
>to read a new address from an external eeprom or the otp area, and in case both
>read fail (or the address read back is invalid), it randomly generates a new
>one.
>
>Unfortunately, due to the way the above logic is laid out,
>if both read_eeprom() and read_otp() fail, a new mac address is correctly
>generated but is never written back to RX_ADDRL & RX_ADDRH, leaving the chip in an
>incosistent state and with an invalid mac address (e.g. the nic appears to be
>completely dead, and doesn't receive any packet, etc):
>
>lan78xx_init_mac_address()
>...
>if (lan78xx_read_eeprom(addr ...) || lan78xx_read_otp(addr ...)) {
>	if (is_valid_ether_addr(addr) {
>		// nop...
>	} else {
>		random_ether_addr(addr);
>	}
>
>	// correctly writes back the new address
>	lan78xx_write_reg(RX_ADDRL, addr ...);
>	lan78xx_write_reg(RX_ADDRH, addr ...);
>} else {
>	// XXX if both eeprom and otp read fail, we land here and skip
>	// XXX the RX_ADDRL & RX_ADDRH update completely
>	random_ether_addr(addr);
>}
>
>This bug went unnoticed because lan78xx_read_otp() was buggy itself and would
>never fail, up until 4bfc338 "lan78xx: Correctly indicate invalid OTP"
>fixed it and as a side effect uncovered this bug.
>
>4.18+ is fine, since the bug was implicitly fixed in 760db29 "lan78xx: Read MAC
>address from DT if present" when the address change logic was reorganized, but
>it's still present in all stable trees below that: linux-4.4.y, linux-4.9.y,
>linux-4.14.y, etc up to linux-4.18.y (not included).
>
>Signed-off-by: Paolo Pisati <p.pisati@gmail.com>

So why not just take 760db29bdc completely? It looks safer than taking a
partial backport, and will make applying future patches easier.

I tried to do it and it doesn't look like there are any dependencies
that would cause an issue.

^ permalink raw reply

* [PATCH V2 bpf-next 1/2] bpf: add perf-event notificaton support for sock_ops
From: Sowmini Varadhan @ 2018-11-08  0:12 UTC (permalink / raw)
  To: sowmini.varadhan, daniel, netdev, davem, brakmo, ast
In-Reply-To: <cover.1541630641.git.sowmini.varadhan@oracle.com>

This patch allows eBPF programs that use sock_ops to send
perf-based event notifications using bpf_perf_event_output()

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/core/filter.c |   19 +++++++++++++++++++
 1 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index e521c5e..23464a3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4048,6 +4048,23 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff,
 	return ret;
 }
 
+BPF_CALL_5(bpf_sock_opts_event_output, struct bpf_sock_ops *, skops,
+	   struct bpf_map *, map, u64, flags, void *, data, u64, size)
+{
+	return bpf_event_output(map, flags, data, size, NULL, 0, NULL);
+}
+
+static const struct bpf_func_proto bpf_sock_ops_event_output_proto =  {
+	.func		= bpf_sock_opts_event_output,
+	.gpl_only       = true,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_ANYTHING,
+	.arg4_type      = ARG_PTR_TO_MEM,
+	.arg5_type      = ARG_CONST_SIZE_OR_ZERO,
+};
+
 static const struct bpf_func_proto bpf_setsockopt_proto = {
 	.func		= bpf_setsockopt,
 	.gpl_only	= false,
@@ -5226,6 +5243,8 @@ bool bpf_helper_changes_pkt_data(void *func)
 sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
 	switch (func_id) {
+	case BPF_FUNC_perf_event_output:
+		return &bpf_sock_ops_event_output_proto;
 	case BPF_FUNC_setsockopt:
 		return &bpf_setsockopt_proto;
 	case BPF_FUNC_getsockopt:
-- 
1.7.1

^ permalink raw reply related

* [PATCH V2 bpf-next 2/2] selftests/bpf: add a test case for sock_ops perf-event notification
From: Sowmini Varadhan @ 2018-11-08  0:12 UTC (permalink / raw)
  To: sowmini.varadhan, daniel, netdev, davem, brakmo, ast
In-Reply-To: <cover.1541630641.git.sowmini.varadhan@oracle.com>

This patch provides a tcp_bpf based eBPF sample. The test
- ncat(1) as the TCP client program to connect() to a port
  with the intention of triggerring SYN retransmissions: we
  first install an iptables DROP rule to make sure ncat SYNs are
  resent (instead of aborting instantly after a TCP RST)
- has a bpf kernel module that sends a perf-event notification for
  each TCP retransmit, and also tracks the number of such notifications
  sent in the global_map
The test passes when the number of event notifications intercepted
in user-space matches the value in the global_map.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
V2: inline call to sys_perf_event_open() following the style of existing
code in kselftests/bpf

 tools/testing/selftests/bpf/Makefile              |    4 +-
 tools/testing/selftests/bpf/test_tcpnotify.h      |   19 ++
 tools/testing/selftests/bpf/test_tcpnotify_kern.c |   95 +++++++++++
 tools/testing/selftests/bpf/test_tcpnotify_user.c |  186 +++++++++++++++++++++
 4 files changed, 303 insertions(+), 1 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index e39dfb4..6c94048 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -24,12 +24,13 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test
 	test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
 	test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \
 	test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \
-	test_netcnt
+	test_netcnt test_tcpnotify_user
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \
 	test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o     \
 	sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \
 	test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \
+	test_tcpnotify_kern.o \
 	sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \
 	sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \
 	test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \
@@ -74,6 +75,7 @@ $(OUTPUT)/test_sock_addr: cgroup_helpers.c
 $(OUTPUT)/test_socket_cookie: cgroup_helpers.c
 $(OUTPUT)/test_sockmap: cgroup_helpers.c
 $(OUTPUT)/test_tcpbpf_user: cgroup_helpers.c
+$(OUTPUT)/test_tcpnotify_user: cgroup_helpers.c trace_helpers.c
 $(OUTPUT)/test_progs: trace_helpers.c
 $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c
 $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c
diff --git a/tools/testing/selftests/bpf/test_tcpnotify.h b/tools/testing/selftests/bpf/test_tcpnotify.h
new file mode 100644
index 0000000..8b6cea0
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcpnotify.h
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef _TEST_TCPBPF_H
+#define _TEST_TCPBPF_H
+
+struct tcpnotify_globals {
+	__u32 total_retrans;
+	__u32 ncalls;
+};
+
+struct tcp_notifier {
+	__u8    type;
+	__u8    subtype;
+	__u8    source;
+	__u8    hash;
+};
+
+#define	TESTPORT	12877
+#endif
diff --git a/tools/testing/selftests/bpf/test_tcpnotify_kern.c b/tools/testing/selftests/bpf/test_tcpnotify_kern.c
new file mode 100644
index 0000000..edbca20
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcpnotify_kern.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/types.h>
+#include <linux/socket.h>
+#include <linux/tcp.h>
+#include <netinet/in.h>
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+#include "test_tcpnotify.h"
+
+struct bpf_map_def SEC("maps") global_map = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(__u32),
+	.value_size = sizeof(struct tcpnotify_globals),
+	.max_entries = 4,
+};
+
+struct bpf_map_def SEC("maps") perf_event_map = {
+	.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(__u32),
+	.max_entries = 2,
+};
+
+int _version SEC("version") = 1;
+
+SEC("sockops")
+int bpf_testcb(struct bpf_sock_ops *skops)
+{
+	int rv = -1;
+	int op;
+
+	op = (int) skops->op;
+
+	if (bpf_ntohl(skops->remote_port) != TESTPORT) {
+		skops->reply = -1;
+		return 0;
+	}
+
+	switch (op) {
+	case BPF_SOCK_OPS_TIMEOUT_INIT:
+	case BPF_SOCK_OPS_RWND_INIT:
+	case BPF_SOCK_OPS_NEEDS_ECN:
+	case BPF_SOCK_OPS_BASE_RTT:
+	case BPF_SOCK_OPS_RTO_CB:
+		rv = 1;
+		break;
+
+	case BPF_SOCK_OPS_TCP_CONNECT_CB:
+	case BPF_SOCK_OPS_TCP_LISTEN_CB:
+	case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
+	case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
+		bpf_sock_ops_cb_flags_set(skops, (BPF_SOCK_OPS_RETRANS_CB_FLAG|
+					  BPF_SOCK_OPS_RTO_CB_FLAG));
+		rv = 1;
+		break;
+	case BPF_SOCK_OPS_RETRANS_CB: {
+			__u32 key = 0;
+			struct tcpnotify_globals g, *gp;
+			struct tcp_notifier msg = {
+				.type = 0xde,
+				.subtype = 0xad,
+				.source = 0xbe,
+				.hash = 0xef,
+			};
+
+			rv = 1;
+
+			/* Update results */
+			gp = bpf_map_lookup_elem(&global_map, &key);
+			if (!gp)
+				break;
+			g = *gp;
+			g.total_retrans = skops->total_retrans;
+			g.ncalls++;
+			bpf_map_update_elem(&global_map, &key, &g,
+					    BPF_ANY);
+			bpf_perf_event_output(skops, &perf_event_map,
+					      BPF_F_CURRENT_CPU,
+					      &msg, sizeof(msg));
+		}
+		break;
+	default:
+		rv = -1;
+	}
+	skops->reply = rv;
+	return 1;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tcpnotify_user.c b/tools/testing/selftests/bpf/test_tcpnotify_user.c
new file mode 100644
index 0000000..ff3c452
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tcpnotify_user.c
@@ -0,0 +1,186 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <pthread.h>
+#include <inttypes.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <asm/types.h>
+#include <sys/syscall.h>
+#include <errno.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <sys/socket.h>
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+#include <sys/ioctl.h>
+#include <linux/rtnetlink.h>
+#include <signal.h>
+#include <linux/perf_event.h>
+
+#include "bpf_rlimit.h"
+#include "bpf_util.h"
+#include "cgroup_helpers.h"
+
+#include "test_tcpnotify.h"
+#include "trace_helpers.h"
+
+#define SOCKET_BUFFER_SIZE (getpagesize() < 8192L ? getpagesize() : 8192L)
+
+pthread_t tid;
+int rx_callbacks;
+
+static int dummyfn(void *data, int size)
+{
+	struct tcp_notifier *t = data;
+
+	if (t->type != 0xde || t->subtype != 0xad ||
+	    t->source != 0xbe || t->hash != 0xef)
+		return 1;
+	rx_callbacks++;
+	return 0;
+}
+
+void tcp_notifier_poller(int fd)
+{
+	while (1)
+		perf_event_poller(fd, dummyfn);
+}
+
+static void *poller_thread(void *arg)
+{
+	int fd = *(int *)arg;
+
+	tcp_notifier_poller(fd);
+	return arg;
+}
+
+int verify_result(const struct tcpnotify_globals *result)
+{
+	return (result->ncalls > 0 && result->ncalls == rx_callbacks ? 0 : 1);
+}
+
+static int bpf_find_map(const char *test, struct bpf_object *obj,
+			const char *name)
+{
+	struct bpf_map *map;
+
+	map = bpf_object__find_map_by_name(obj, name);
+	if (!map) {
+		printf("%s:FAIL:map '%s' not found\n", test, name);
+		return -1;
+	}
+	return bpf_map__fd(map);
+}
+
+static int setup_bpf_perf_event(int mapfd)
+{
+	struct perf_event_attr attr = {
+		.sample_type = PERF_SAMPLE_RAW,
+		.type = PERF_TYPE_SOFTWARE,
+		.config = PERF_COUNT_SW_BPF_OUTPUT,
+	};
+	int key = 0;
+	int pmu_fd;
+
+	pmu_fd = syscall(__NR_perf_event_open, &attr, -1, 0, -1, 0);
+	if (pmu_fd < 0)
+		return pmu_fd;
+	bpf_map_update_elem(mapfd, &key, &pmu_fd, BPF_ANY);
+
+	ioctl(pmu_fd, PERF_EVENT_IOC_ENABLE, 0);
+	return pmu_fd;
+}
+
+int main(int argc, char **argv)
+{
+	const char *file = "test_tcpnotify_kern.o";
+	int prog_fd, map_fd, perf_event_fd;
+	struct tcpnotify_globals g = {0};
+	const char *cg_path = "/foo";
+	int error = EXIT_FAILURE;
+	struct bpf_object *obj;
+	int cg_fd = -1;
+	__u32 key = 0;
+	int rv;
+	char test_script[80];
+	int pmu_fd;
+	cpu_set_t cpuset;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
+
+	if (setup_cgroup_environment())
+		goto err;
+
+	cg_fd = create_and_get_cgroup(cg_path);
+	if (!cg_fd)
+		goto err;
+
+	if (join_cgroup(cg_path))
+		goto err;
+
+	if (bpf_prog_load(file, BPF_PROG_TYPE_SOCK_OPS, &obj, &prog_fd)) {
+		printf("FAILED: load_bpf_file failed for: %s\n", file);
+		goto err;
+	}
+
+	rv = bpf_prog_attach(prog_fd, cg_fd, BPF_CGROUP_SOCK_OPS, 0);
+	if (rv) {
+		printf("FAILED: bpf_prog_attach: %d (%s)\n",
+		       error, strerror(errno));
+		goto err;
+	}
+
+	perf_event_fd = bpf_find_map(__func__, obj, "perf_event_map");
+	if (perf_event_fd < 0)
+		goto err;
+
+	map_fd = bpf_find_map(__func__, obj, "global_map");
+	if (map_fd < 0)
+		goto err;
+
+	pmu_fd = setup_bpf_perf_event(perf_event_fd);
+	if (pmu_fd < 0 || perf_event_mmap(pmu_fd) < 0)
+		goto err;
+
+	pthread_create(&tid, NULL, poller_thread, (void *)&pmu_fd);
+
+	sprintf(test_script,
+		"/usr/sbin/iptables -A INPUT -p tcp --dport %d -j DROP",
+		TESTPORT);
+	system(test_script);
+
+	sprintf(test_script,
+		"/usr/bin/nc 127.0.0.1 %d < /etc/passwd > /dev/null 2>&1 ",
+		TESTPORT);
+	system(test_script);
+
+	sprintf(test_script,
+		"/usr/sbin/iptables -D INPUT -p tcp --dport %d -j DROP",
+		TESTPORT);
+	system(test_script);
+
+	rv = bpf_map_lookup_elem(map_fd, &key, &g);
+	if (rv != 0) {
+		printf("FAILED: bpf_map_lookup_elem returns %d\n", rv);
+		goto err;
+	}
+
+	sleep(10);
+
+	if (verify_result(&g)) {
+		printf("FAILED: Wrong stats Expected %d calls, got %d\n",
+			g.ncalls, rx_callbacks);
+		goto err;
+	}
+
+	printf("PASSED!\n");
+	error = 0;
+err:
+	bpf_prog_detach(cg_fd, BPF_CGROUP_SOCK_OPS);
+	close(cg_fd);
+	cleanup_cgroup_environment();
+	return error;
+}
-- 
1.7.1

^ permalink raw reply related

* [PATCH V2 bpf-next 0/2] Perf-based event notification for sock_ops
From: Sowmini Varadhan @ 2018-11-08  0:12 UTC (permalink / raw)
  To: sowmini.varadhan, daniel, netdev, davem, brakmo, ast

This patchset uses eBPF perf-event based notification mechanism to solve
the problem described in 
   https://marc.info/?l=linux-netdev&m=154022219423571&w=2.
Thanks to Daniel Borkmann for feedback/input.

V2: inlined the call to sys_perf_event_open() following the style
    of existing code in kselftests/bpf

The problem statement is
  We would like to monitor some subset of TCP sockets in user-space,
  (the monitoring application would define 4-tuples it wants to monitor)
  using TCP_INFO stats to analyze reported problems. The idea is to
  use those stats to see where the bottlenecks are likely to be ("is it
  application-limited?" or "is there evidence of BufferBloat in the
  path?" etc)

  Today we can do this by periodically polling for tcp_info, but this
  could be made more efficient if the kernel would asynchronously
  notify the application via tcp_info when some "interesting"
  thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
  are reached. And to make this effective, it is better if
  we could apply the threshold check *before* constructing the
  tcp_info netlink notification, so that we don't waste resources
  constructing notifications that will be discarded by the filter.

This patchset solves the problem by adding perf-event based notification
support for sock_ops (Patch1). The eBPF kernel module can thus 
be designed to apply any desired filters to the bpf_sock_ops and
trigger a perf-event notification based on the verdict from the filter.
The uspace component can use these perf-event notifications to either
read any state managed by the eBPF kernel module, or issue a TCP_INFO 
netlink call if desired.

Patch 2 provides a simple example that shows how to use this infra
(and also provides a test case for it)

Sowmini Varadhan (2):
  bpf: add perf-event notificaton support for sock_ops
  selftests/bpf: add a test case for sock_ops perf-event notification

 net/core/filter.c                                 |   19 ++
 tools/testing/selftests/bpf/Makefile              |    4 +-
 tools/testing/selftests/bpf/test_tcpnotify.h      |   19 ++
 tools/testing/selftests/bpf/test_tcpnotify_kern.c |   95 +++++++++++
 tools/testing/selftests/bpf/test_tcpnotify_user.c |  186 +++++++++++++++++++++
 5 files changed, 322 insertions(+), 1 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c

^ permalink raw reply

* Re: [PATCH net-next 00/10] udp: implement GRO support
From: David Miller @ 2018-11-08  0:23 UTC (permalink / raw)
  To: pabeni; +Cc: netdev, willemb, steffen.klassert, subashab
In-Reply-To: <cover.1541588248.git.pabeni@redhat.com>

From: Paolo Abeni <pabeni@redhat.com>
Date: Wed,  7 Nov 2018 12:38:27 +0100

> This series implements GRO support for UDP sockets, as the RX counterpart
> of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT").
> The core functionality is implemented by the second patch, introducing a new
> sockopt to enable UDP_GRO, while patch 3 implements support for passing the
> segment size to the user space via a new cmsg.
> UDP GRO performs a socket lookup for each ingress packets and aggregate datagram
> directed to UDP GRO enabled sockets with constant l4 tuple.
> 
> UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables NAT
> rules, and that could potentially confuse existing applications.
> 
> The solution adopted here is to de-segment the GRO packet before enqueuing
> as needed. Since we must cope with packet reinsertion after de-segmentation,
> the relevant code is factored-out in ipv4 and ipv6 specific helpers and exposed
> to UDP usage.
> 
> While the current code can probably be improved, this safeguard ,implemented in
> the patches 4-7, allows future enachements to enable UDP GSO offload on more
> virtual devices eventually even on forwarded packets.
> 
> The last 4 for patches implement some performance and functional self-tests,
> re-using the existing udpgso infrastructure. The problematic scenario described
> above is explicitly tested.
> 
> This revision of the series try to address the feedback provided by Willem and
> Subash on previous iteration.

Series applied.

^ permalink raw reply

* Re: [PATCH bpf-next 3/3] bpftool: support loading flow dissector
From: Stanislav Fomichev @ 2018-11-08  0:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	quentin.monnet, guro, jiong.wang, bhole_prashant_q7,
	john.fastabend, jbenc, treeze.taeung, yhs, osk, sandipan
In-Reply-To: <20181107154155.4e7ca3b3@cakuba.netronome.com>

On 11/07, Jakub Kicinski wrote:
> On Wed, 7 Nov 2018 15:34:48 -0800, Stanislav Fomichev wrote:
> > On 11/07, Jakub Kicinski wrote:
> > > On Wed, 7 Nov 2018 15:13:33 -0800, Stanislav Fomichev wrote:  
> > > > On 11/07, Jakub Kicinski wrote:  
> > > > > On Wed,  7 Nov 2018 14:43:56 -0800, Stanislav Fomichev wrote:    
> > > > > > bpftool map update pinned /sys/fs/bpf/flow/jmp_table \
> > > > > >         key 0 0 0 0 \
> > > > > >         value pinned /sys/fs/bpf/flow/IP/0    
> > > > > 
> > > > > Where is that /0 coming from ?  Is that in source code?  I don't see
> > > > > libbpf adding it, maybe I'm missing something.    
> > > > libbpf adds that, that's a program instance:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/bpf/libbpf.c#n1744  
> > > 
> > > Ugh, I was looking at bpf_object__pin() which uses names :(
> > > 
> > > We never use this multi-instance thing, and I don't think bpftool ever
> > > will, so IMHO it'd be good if we just re-did the pinning loop in
> > > bpftool.  
> > I wonder whether I should just add special case to bpf_program__pin: don't
> > create a subdir when instances.nr == 1 (and just create a file pin for
> > single instance)? In that case I can continue to use libbpf and don't reinvent
> > the wheel. Any objections?
> 
> Mm.. I'm afraid libbpf needs to keep backward compatibility.  We'd have
> to add some way for the user (bpftool code) to request the instance ID
> does not appear, but (potential) existing users should keep seeing them.
> Perhaps others disagree.
AFAICT, nobody (seriously) uses bpf_object__pin in the kernel tree and I
have a feeling that the situation is the same outside of the kernel tree.
We can revert/work around if we break somebody, I just don't want to
reimplement the same code in bpftool while there is a possibility that
nobody is using that.

I'll post my proposal as v3, let's see whether other people have
the same objections.

Btw, did we officially commit to the libbpf api/abi somewhere? It always
felt to me like an internal and work-in-progress library.

^ permalink raw reply

* Aw: [PATCH] staging: net: ipv4: tcp_westwood: fixed warnings and checks
From: Lino Sanfilippo @ 2018-11-08 10:22 UTC (permalink / raw)
  To: Suraj Singh; +Cc: davem, kuznet, yoshfuji, netdev, linux-kernel, suraj1998
In-Reply-To: <1541670377-17483-1-git-send-email-suraj1998@gmail.com>

Hi,

> Gesendet: Donnerstag, 08. November 2018 um 10:46 Uhr
> Von: "Suraj Singh" <suraj1998@gmail.com>
> An: davem@davemloft.net
> Cc: kuznet@ms2.inr.ac.ru, yoshfuji@linux-ipv6.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, suraj1998@gmail.com
> Betreff: [PATCH] staging: net: ipv4: tcp_westwood: fixed warnings and checks
>
> Fixed warnings and checks for TCP Westwood
> 
> Signed-off-by: Suraj Singh <suraj1998@gmail.com>
> ---


You use the prefix "staging" again in you subject line, which still is wrong. "staging" means that you
fix something from the staging area (see drivers/staging/ in the kernel sources. BTW: the staging area is a much
better starting point for first patches than the core parts of the network subsystem).
Also if you send a subsequent version of a patch you have to write the version number (e.g. v2) in the 
subject line. Have a look at other patch submissions on this mailing list for examples.

Regards,
Lino

^ permalink raw reply

* [PATCH][net-hns3-next] net: hns3: fix spelling mistake, "assertting" -> "asserting"
From: Colin King @ 2018-11-08 10:32 UTC (permalink / raw)
  To: Yisen Zhuang, Salil Mehta, David S . Miller, Peng Li, netdev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Trivial fix to spelling mistake in dev_err error message

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
index 579945bdcc76..77980e54ad09 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c
@@ -2504,7 +2504,7 @@ static int hclge_reset_prepare_wait(struct hclge_dev *hdev)
 		ret = hclge_func_reset_cmd(hdev, 0);
 		if (ret) {
 			dev_err(&hdev->pdev->dev,
-				"assertting function reset fail %d!\n", ret);
+				"asserting function reset fail %d!\n", ret);
 			return ret;
 		}
 
-- 
2.19.1

^ permalink raw reply related

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-08  0:59 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David Ahern, netdev, Yoel Caspersen
In-Reply-To: <20181105211733.7468cc61@redhat.com>



W dniu 05.11.2018 o 21:17, Jesper Dangaard Brouer pisze:
> On Sun, 4 Nov 2018 01:24:03 +0100 Paweł Staszewski <pstaszewski@itcare.pl> wrote:
>
>> And today again after allpy patch for page allocator - reached again
>> 64/64 Gbit/s
>>
>> with only 50-60% cpu load
> Great.
>
>> today no slowpath hit for netwoking :)
>>
>> But again dropped pckt at 64GbitRX and 64TX ....
>> And as it should not be pcie express limit  -i think something more is
> Well, this does sounds like a PCIe bandwidth limit to me.
>
> See the PCIe BW here: https://en.wikipedia.org/wiki/PCI_Express
>
> You likely have PCIe v3, where 1-lane have 984.6 MBytes/s or 7.87 Gbit/s
> Thus,  x16-lanes have 15.75 GBytes or 126 Gbit/s.  It does say "in each
> direction", but you are also forwarding this RX->TX on both (dual) ports
> NIC that is sharing the same PCIe slot.
Network controller changed from 2-port 100G connectx4 to 2 separate 
cards 100G connectx5


    PerfTop:   92239 irqs/sec  kernel:99.4%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      6.65%  [kernel]       [k] irq_entries_start
      5.57%  [kernel]       [k] tasklet_action_common.isra.21
      4.60%  [kernel]       [k] mlx5_eq_int
      4.04%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
      3.66%  [kernel]       [k] _raw_spin_lock_irqsave
      3.58%  [kernel]       [k] mlx5e_sq_xmit
      2.66%  [kernel]       [k] fib_table_lookup
      2.52%  [kernel]       [k] _raw_spin_lock
      2.51%  [kernel]       [k] build_skb
      2.50%  [kernel]       [k] _raw_spin_lock_irq
      2.04%  [kernel]       [k] try_to_wake_up
      1.83%  [kernel]       [k] queued_spin_lock_slowpath
      1.81%  [kernel]       [k] mlx5e_poll_tx_cq
      1.65%  [kernel]       [k] do_idle
      1.50%  [kernel]       [k] mlx5e_poll_rx_cq
      1.34%  [kernel]       [k] __sched_text_start
      1.32%  [kernel]       [k] cmd_exec
      1.30%  [kernel]       [k] cmd_work_handler
      1.16%  [kernel]       [k] vlan_do_receive
      1.15%  [kernel]       [k] memcpy_erms
      1.15%  [kernel]       [k] __dev_queue_xmit
      1.07%  [kernel]       [k] mlx5_cmd_comp_handler
      1.06%  [kernel]       [k] sched_ttwu_pending
      1.00%  [kernel]       [k] ipt_do_table
      0.98%  [kernel]       [k] ip_finish_output2
      0.92%  [kernel]       [k] pfifo_fast_dequeue
      0.88%  [kernel]       [k] mlx5e_handle_rx_cqe_mpwrq
      0.78%  [kernel]       [k] dev_gro_receive
      0.78%  [kernel]       [k] mlx5e_napi_poll
      0.76%  [kernel]       [k] mlx5e_post_rx_mpwqes
      0.70%  [kernel]       [k] process_one_work
      0.67%  [kernel]       [k] __netif_receive_skb_core
      0.65%  [kernel]       [k] __build_skb
      0.63%  [kernel]       [k] llist_add_batch
      0.62%  [kernel]       [k] tcp_gro_receive
      0.60%  [kernel]       [k] inet_gro_receive
      0.59%  [kernel]       [k] ip_route_input_rcu
      0.59%  [kernel]       [k] rcu_irq_exit
      0.56%  [kernel]       [k] napi_complete_done
      0.52%  [kernel]       [k] kmem_cache_alloc
      0.48%  [kernel]       [k] __softirqentry_text_start
      0.48%  [kernel]       [k] mlx5e_xmit
      0.47%  [kernel]       [k] __queue_work
      0.46%  [kernel]       [k] memset_erms
      0.46%  [kernel]       [k] dev_hard_start_xmit
      0.45%  [kernel]       [k] insert_work
      0.45%  [kernel]       [k] enqueue_task_fair
      0.44%  [kernel]       [k] __wake_up_common
      0.43%  [kernel]       [k] finish_task_switch
      0.43%  [kernel]       [k] kmem_cache_free_bulk
      0.42%  [kernel]       [k] ip_forward
      0.42%  [kernel]       [k] worker_thread
      0.41%  [kernel]       [k] schedule
      0.41%  [kernel]       [k] _raw_spin_unlock_irqrestore
      0.40%  [kernel]       [k] netif_skb_features
      0.40%  [kernel]       [k] queue_work_on
      0.40%  [kernel]       [k] pfifo_fast_enqueue
      0.39%  [kernel]       [k] vlan_dev_hard_start_xmit
      0.39%  [kernel]       [k] page_frag_free
      0.36%  [kernel]       [k] swiotlb_map_page
      0.36%  [kernel]       [k] update_cfs_rq_h_load
      0.35%  [kernel]       [k] validate_xmit_skb.isra.142
      0.35%  [kernel]       [k] dev_ifconf
      0.35%  [kernel]       [k] check_preempt_curr
      0.34%  [kernel]       [k] _raw_spin_trylock
      0.34%  [kernel]       [k] rcu_idle_exit
      0.33%  [kernel]       [k] ip_rcv_core.isra.20.constprop.25
      0.33%  [kernel]       [k] __qdisc_run
      0.33%  [kernel]       [k] skb_release_data
      0.32%  [kernel]       [k] native_sched_clock
      0.30%  [kernel]       [k] add_interrupt_randomness
      0.29%  [kernel]       [k] interrupt_entry
      0.28%  [kernel]       [k] skb_gro_receive
      0.26%  [kernel]       [k] read_tsc
      0.26%  [kernel]       [k] __get_xps_queue_idx
      0.26%  [kernel]       [k] inet_gifconf
      0.26%  [kernel]       [k] skb_segment
      0.25%  [kernel]       [k] __tasklet_schedule_common
      0.25%  [kernel]       [k] smpboot_thread_fn
      0.23%  [kernel]       [k] __update_load_avg_se
      0.22%  [kernel]       [k] tcp4_gro_receive


Not much traffic now:
   bwm-ng v0.6.1 (probing every 0.500s), press 'h' for help
   input: /proc/net/dev type: rate
   |         iface                   Rx Tx                Total
==============================================================================
          enp175s0:           6.95 Gb/s            4.20 Gb/s           
11.15 Gb/s
          enp216s0:           4.23 Gb/s            6.98 Gb/s           
11.21 Gb/s
------------------------------------------------------------------------------
             total:          11.18 Gb/s           11.18 Gb/s           
22.37 Gb/s

   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   |         iface                   Rx Tx                Total
==============================================================================
          enp175s0:       700264.50 P/s        923890.25 P/s 1624154.75 P/s
          enp216s0:       932598.81 P/s        708771.50 P/s 1641370.25 P/s
------------------------------------------------------------------------------
             total:      1632863.38 P/s       1632661.75 P/s 3265525.00 P/s






>
>
>> going on there - and hard to catch - cause perf top doestn chenged
>> besides there is no queued slowpath hit now
>>
>> I ordered now also intel cards to compare - but 3 weeks eta
>> Faster - cause 3 days - i will have mellanox connectx 5 - so can
>> separate traffic to two different x16 pcie busses
> I do think you need to separate traffic to two different x16 PCIe
> slots.  I have found that the ConnectX-5 is significantly faster
> packet-per-sec performance than ConnectX-4, but that is not your
> use-case (max BW). I've not tested these NICs for maximum
> _bidirectional_ bandwidth limits, I've only made sure I can do 100G
> unidirectional, which can hit some funny motherboard memory limits
> (remember to equip motherboard with 4 RAM blocks for full memory BW).
>
Yes memory channels are separated and there are 4 modules per cpu :)

^ permalink raw reply

* [PATCH bpf] tools/bpftool: copy uapi/linux/tc_act/tc_bpf.h to tools directory
From: Yonghong Song @ 2018-11-08  1:00 UTC (permalink / raw)
  To: ast, kafai, daniel, netdev, rong.a.chen, zhijianx.li; +Cc: kernel-team

Commit f6f3bac08ff9 ("tools/bpf: bpftool: add net support")
added certain networking support to bpftool.
The implementation relies on a relatively recent uapi header file
linux/tc_act/tc_bpf.h on the host which contains the marco
definition of TCA_ACT_BPF_ID.

Unfortunately, this is not the case for all distributions.
See the email message below where rhel-7.2 does not have
an up-to-date linux/tc_act/tc_bpf.h.
  https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1799211.html

This patch fixed the issue by copying linux/tc_act/tc_bpf.h from
kernel include/uapi directory to tools/include/uapi directory so
building the bpftool does not depend on host system for this file.

Fixes: f6f3bac08ff9 ("tools/bpf: bpftool: add net support")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Yonghong Song <yhs@fb.com>
---
 tools/include/uapi/linux/tc_act/tc_bpf.h | 37 ++++++++++++++++++++++++
 1 file changed, 37 insertions(+)
 create mode 100644 tools/include/uapi/linux/tc_act/tc_bpf.h

diff --git a/tools/include/uapi/linux/tc_act/tc_bpf.h b/tools/include/uapi/linux/tc_act/tc_bpf.h
new file mode 100644
index 000000000000..6e89a5df49a4
--- /dev/null
+++ b/tools/include/uapi/linux/tc_act/tc_bpf.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+/*
+ * Copyright (c) 2015 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __LINUX_TC_BPF_H
+#define __LINUX_TC_BPF_H
+
+#include <linux/pkt_cls.h>
+
+#define TCA_ACT_BPF 13
+
+struct tc_act_bpf {
+	tc_gen;
+};
+
+enum {
+	TCA_ACT_BPF_UNSPEC,
+	TCA_ACT_BPF_TM,
+	TCA_ACT_BPF_PARMS,
+	TCA_ACT_BPF_OPS_LEN,
+	TCA_ACT_BPF_OPS,
+	TCA_ACT_BPF_FD,
+	TCA_ACT_BPF_NAME,
+	TCA_ACT_BPF_PAD,
+	TCA_ACT_BPF_TAG,
+	TCA_ACT_BPF_ID,
+	__TCA_ACT_BPF_MAX,
+};
+#define TCA_ACT_BPF_MAX (__TCA_ACT_BPF_MAX - 1)
+
+#endif
-- 
2.17.1

^ permalink raw reply related

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-08  1:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: David Ahern, netdev, Yoel Caspersen
In-Reply-To: <795357b6-04b8-dbc2-acfe-d561f10d4a2a@itcare.pl>



W dniu 08.11.2018 o 01:59, Paweł Staszewski pisze:
>
>
> W dniu 05.11.2018 o 21:17, Jesper Dangaard Brouer pisze:
>> On Sun, 4 Nov 2018 01:24:03 +0100 Paweł Staszewski 
>> <pstaszewski@itcare.pl> wrote:
>>
>>> And today again after allpy patch for page allocator - reached again
>>> 64/64 Gbit/s
>>>
>>> with only 50-60% cpu load
>> Great.
>>
>>> today no slowpath hit for netwoking :)
>>>
>>> But again dropped pckt at 64GbitRX and 64TX ....
>>> And as it should not be pcie express limit  -i think something more is
>> Well, this does sounds like a PCIe bandwidth limit to me.
>>
>> See the PCIe BW here: https://en.wikipedia.org/wiki/PCI_Express
>>
>> You likely have PCIe v3, where 1-lane have 984.6 MBytes/s or 7.87 Gbit/s
>> Thus,  x16-lanes have 15.75 GBytes or 126 Gbit/s.  It does say "in each
>> direction", but you are also forwarding this RX->TX on both (dual) ports
>> NIC that is sharing the same PCIe slot.
> Network controller changed from 2-port 100G connectx4 to 2 separate 
> cards 100G connectx5
>
>
>    PerfTop:   92239 irqs/sec  kernel:99.4%  exact:  0.0% [4000Hz 
> cycles],  (all, 56 CPUs)
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
>
>
>      6.65%  [kernel]       [k] irq_entries_start
>      5.57%  [kernel]       [k] tasklet_action_common.isra.21
>      4.60%  [kernel]       [k] mlx5_eq_int
>      4.04%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
>      3.66%  [kernel]       [k] _raw_spin_lock_irqsave
>      3.58%  [kernel]       [k] mlx5e_sq_xmit
>      2.66%  [kernel]       [k] fib_table_lookup
>      2.52%  [kernel]       [k] _raw_spin_lock
>      2.51%  [kernel]       [k] build_skb
>      2.50%  [kernel]       [k] _raw_spin_lock_irq
>      2.04%  [kernel]       [k] try_to_wake_up
>      1.83%  [kernel]       [k] queued_spin_lock_slowpath
>      1.81%  [kernel]       [k] mlx5e_poll_tx_cq
>      1.65%  [kernel]       [k] do_idle
>      1.50%  [kernel]       [k] mlx5e_poll_rx_cq
>      1.34%  [kernel]       [k] __sched_text_start
>      1.32%  [kernel]       [k] cmd_exec
>      1.30%  [kernel]       [k] cmd_work_handler
>      1.16%  [kernel]       [k] vlan_do_receive
>      1.15%  [kernel]       [k] memcpy_erms
>      1.15%  [kernel]       [k] __dev_queue_xmit
>      1.07%  [kernel]       [k] mlx5_cmd_comp_handler
>      1.06%  [kernel]       [k] sched_ttwu_pending
>      1.00%  [kernel]       [k] ipt_do_table
>      0.98%  [kernel]       [k] ip_finish_output2
>      0.92%  [kernel]       [k] pfifo_fast_dequeue
>      0.88%  [kernel]       [k] mlx5e_handle_rx_cqe_mpwrq
>      0.78%  [kernel]       [k] dev_gro_receive
>      0.78%  [kernel]       [k] mlx5e_napi_poll
>      0.76%  [kernel]       [k] mlx5e_post_rx_mpwqes
>      0.70%  [kernel]       [k] process_one_work
>      0.67%  [kernel]       [k] __netif_receive_skb_core
>      0.65%  [kernel]       [k] __build_skb
>      0.63%  [kernel]       [k] llist_add_batch
>      0.62%  [kernel]       [k] tcp_gro_receive
>      0.60%  [kernel]       [k] inet_gro_receive
>      0.59%  [kernel]       [k] ip_route_input_rcu
>      0.59%  [kernel]       [k] rcu_irq_exit
>      0.56%  [kernel]       [k] napi_complete_done
>      0.52%  [kernel]       [k] kmem_cache_alloc
>      0.48%  [kernel]       [k] __softirqentry_text_start
>      0.48%  [kernel]       [k] mlx5e_xmit
>      0.47%  [kernel]       [k] __queue_work
>      0.46%  [kernel]       [k] memset_erms
>      0.46%  [kernel]       [k] dev_hard_start_xmit
>      0.45%  [kernel]       [k] insert_work
>      0.45%  [kernel]       [k] enqueue_task_fair
>      0.44%  [kernel]       [k] __wake_up_common
>      0.43%  [kernel]       [k] finish_task_switch
>      0.43%  [kernel]       [k] kmem_cache_free_bulk
>      0.42%  [kernel]       [k] ip_forward
>      0.42%  [kernel]       [k] worker_thread
>      0.41%  [kernel]       [k] schedule
>      0.41%  [kernel]       [k] _raw_spin_unlock_irqrestore
>      0.40%  [kernel]       [k] netif_skb_features
>      0.40%  [kernel]       [k] queue_work_on
>      0.40%  [kernel]       [k] pfifo_fast_enqueue
>      0.39%  [kernel]       [k] vlan_dev_hard_start_xmit
>      0.39%  [kernel]       [k] page_frag_free
>      0.36%  [kernel]       [k] swiotlb_map_page
>      0.36%  [kernel]       [k] update_cfs_rq_h_load
>      0.35%  [kernel]       [k] validate_xmit_skb.isra.142
>      0.35%  [kernel]       [k] dev_ifconf
>      0.35%  [kernel]       [k] check_preempt_curr
>      0.34%  [kernel]       [k] _raw_spin_trylock
>      0.34%  [kernel]       [k] rcu_idle_exit
>      0.33%  [kernel]       [k] ip_rcv_core.isra.20.constprop.25
>      0.33%  [kernel]       [k] __qdisc_run
>      0.33%  [kernel]       [k] skb_release_data
>      0.32%  [kernel]       [k] native_sched_clock
>      0.30%  [kernel]       [k] add_interrupt_randomness
>      0.29%  [kernel]       [k] interrupt_entry
>      0.28%  [kernel]       [k] skb_gro_receive
>      0.26%  [kernel]       [k] read_tsc
>      0.26%  [kernel]       [k] __get_xps_queue_idx
>      0.26%  [kernel]       [k] inet_gifconf
>      0.26%  [kernel]       [k] skb_segment
>      0.25%  [kernel]       [k] __tasklet_schedule_common
>      0.25%  [kernel]       [k] smpboot_thread_fn
>      0.23%  [kernel]       [k] __update_load_avg_se
>      0.22%  [kernel]       [k] tcp4_gro_receive
>
>
> Not much traffic now:
>   bwm-ng v0.6.1 (probing every 0.500s), press 'h' for help
>   input: /proc/net/dev type: rate
>   |         iface                   Rx Tx                Total
> ============================================================================== 
>
>          enp175s0:           6.95 Gb/s            4.20 Gb/s           
> 11.15 Gb/s
>          enp216s0:           4.23 Gb/s            6.98 Gb/s           
> 11.21 Gb/s
> ------------------------------------------------------------------------------ 
>
>             total:          11.18 Gb/s           11.18 Gb/s           
> 22.37 Gb/s
>
>   bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>   input: /proc/net/dev type: rate
>   |         iface                   Rx Tx                Total
> ============================================================================== 
>
>          enp175s0:       700264.50 P/s        923890.25 P/s 1624154.75 
> P/s
>          enp216s0:       932598.81 P/s        708771.50 P/s 1641370.25 
> P/s
> ------------------------------------------------------------------------------ 
>
>             total:      1632863.38 P/s       1632661.75 P/s 3265525.00 
> P/s
>
>
>
>
Also is that normal that some kworker procs takes 10%+ of cpu ?
below top

  2913 root      20   0       0      0      0 I  10.3  0.0   6:58.29 
kworker/u112:1-
     7 root      20   0       0      0      0 I   8.6  0.0   6:17.18 
kworker/u112:0-
10289 root      20   0       0      0      0 I   6.6  0.0   6:33.90 
kworker/u112:4-
  2939 root      20   0       0      0      0 R   3.6  0.0   7:37.68 
kworker/u112:2-
  4557 root      20   0       0      0      0 I   1.3  0.0   0:08.82 
kworker/45:4-ev
  6775 root      20   0       0      0      0 I   1.3  0.0   0:26.30 
kworker/50:4-ev
  6833 root      20   0       0      0      0 D   1.3  0.0   0:04.96 
kworker/15:0+ev
  6840 root      20   0       0      0      0 I   1.3  0.0   0:09.32 
kworker/55:2-ev
  6874 root      20   0       0      0      0 D   1.3  0.0   0:08.51 
kworker/53:0+ev
  7710 root      20   0       0      0      0 I   1.3  0.0   0:07.78 
kworker/14:1-ev
12075 root      20   0       0      0      0 I   1.3  0.0   1:19.22 
kworker/23:3-ev
31209 root      20   0       0      0      0 I   1.3  0.0   0:07.02 
kworker/20:1-ev
32351 root      20   0       0      0      0 R   1.3  0.0   0:06.99 
kworker/51:2+ev
39869 root      20   0       0      0      0 D   1.3  0.0   0:06.15 
kworker/42:0+ev
39959 root      20   0       0      0      0 I   1.3  0.0   0:16.23 
kworker/51:1-ev
42858 root      20   0       0      0      0 I   1.3  0.0   0:47.72 
kworker/27:2-ev
43281 root      20   0       0      0      0 I   1.3  0.0   0:14.99 
kworker/14:4-ev
43282 root      20   0       0      0      0 I   1.3  0.0   0:13.38 
kworker/16:1-ev
43389 root      20   0       0      0      0 D   1.3  0.0   0:08.92 
kworker/54:2+ev
45214 root      20   0       0      0      0 I   1.3  0.0   0:05.82 
kworker/55:0-ev
46894 root      20   0       0      0      0 I   1.3  0.0   0:04.11 
kworker/46:1-ev
47027 root      20   0       0      0      0 D   1.3  0.0   0:03.79 
kworker/47:1+ev
47129 root      20   0       0      0      0 D   1.3  0.0   0:03.15 
kworker/52:0+ev
47133 root      20   0       0      0      0 I   1.3  0.0   0:03.19 
kworker/49:1-ev
47179 root      20   0       0      0      0 I   1.3  0.0   0:02.83 
kworker/17:3-ev
48062 root      20   0       0      0      0 I   1.3  0.0   0:02.54 
kworker/44:1-ev
48158 root      20   0       0      0      0 I   1.3  0.0   0:02.17 
kworker/16:2-ev
48168 root      20   0       0      0      0 I   1.3  0.0   0:02.13 
kworker/27:3-ev
48247 root      20   0       0      0      0 I   1.3  0.0   0:01.83 
kworker/22:0-ev
48337 root      20   0       0      0      0 I   1.3  0.0   0:01.57 
kworker/15:1-ev
48345 root      20   0       0      0      0 I   1.3  0.0   0:01.49 
kworker/24:3-ev
49302 root      20   0       0      0      0 I   1.3  0.0   0:00.71 
kworker/54:1-ev
49366 root      20   0       0      0      0 I   1.3  0.0   0:00.38 
kworker/20:3-ev
49400 root      20   0       0      0      0 I   1.3  0.0   0:00.31 
kworker/26:2-ev
49430 root      20   0       0      0      0 I   1.3  0.0   0:00.21 
kworker/42:2-ev
49463 root      20   0       0      0      0 D   1.3  0.0   0:00.08 
kworker/50:2+ev
51698 root      20   0       0      0      0 D   1.3  0.0   0:14.85 
kworker/46:2+ev
54238 root      20   0       0      0      0 I   1.3  0.0   0:23.73 
kworker/52:1-ev
  2507 root      20   0       0      0      0 I   1.0  0.0   0:09.60 
kworker/44:2-ev
  4525 root      20   0       0      0      0 I   1.0  0.0   0:08.07 
kworker/26:1-ev
  4556 root      20   0       0      0      0 I   1.0  0.0   0:05.15 
kworker/48:0-ev
  4604 root      20   0       0      0      0 I   1.0  0.0   0:10.90 
kworker/19:0-ev
  5789 root      20   0       0      0      0 I   1.0  0.0   0:08.24 
kworker/18:0-ev
  6868 root      20   0       0      0      0 I   1.0  0.0   0:09.68 
kworker/47:0-ev
  6900 root      20   0       0      0      0 I   1.0  0.0   0:28.83 
kworker/18:1-ev
  7764 root      20   0       0      0      0 I   1.0  0.0   0:03.00 
kworker/49:2-ev
12045 root      20   0       0      0      0 I   1.0  0.0   1:16.98 
kworker/24:2-ev
32218 root      20   0       0      0      0 I   1.0  0.0   0:04.13 
kworker/45:2-ev
34082 root      20   0       0      0      0 I   1.0  0.0   0:06.29 
kworker/17:1-ev
39791 root      20   0       0      0      0 I   1.0  0.0   0:19.51 
kworker/21:4-ev
39973 root      20   0       0      0      0 I   1.0  0.0   0:17.12 
kworker/53:2-ev
43223 root      20   0       0      0      0 I   1.0  0.0   0:07.88 
kworker/25:0-ev
43295 root      20   0       0      0      0 I   1.0  0.0   0:10.89 
kworker/22:4-ev
46055 root      20   0       0      0      0 I   1.0  0.0   0:04.00 
kworker/21:2-ev
46077 root      20   0       0      0      0 I   1.0  0.0   0:04.62 
kworker/19:1-ev
47204 root      20   0       0      0      0 I   1.0  0.0   0:03.03 
kworker/25:2-ev
47989 root      20   0       0      0      0 I   1.0  0.0   0:02.65 
kworker/43:1-ev
49127 root      20   0       0      0      0 I   1.0  0.0   0:01.10 
kworker/48:2-ev
49317 root      20   0       0      0      0 I   1.0  0.0   0:00.56 
kworker/23:1-ev
54191 root      20   0       0      0      0 R   1.0  0.0   0:30.27 
kworker/43:2+ev
    81 root      20   0       0      0      0 S   0.7  0.0   0:50.27 
ksoftirqd/14
    87 root      20   0       0      0      0 S   0.7  0.0   1:02.92 
ksoftirqd/15
   102 root      20   0       0      0      0 S   0.7  0.0   0:29.78 
ksoftirqd/18
   117 root      20   0       0      0      0 S   0.7  0.0   0:30.73 
ksoftirqd/21
   127 root      20   0       0      0      0 S   0.7  0.0   0:24.45 
ksoftirqd/23
   137 root      20   0       0      0      0 S   0.7  0.0   0:24.94 
ksoftirqd/25
   142 root      20   0       0      0      0 S   0.7  0.0   0:21.74 
ksoftirqd/26
   222 root      20   0       0      0      0 S   0.7  0.0   0:27.83 
ksoftirqd/42
   227 root      20   0       0      0      0 S   0.7  0.0   0:25.35 
ksoftirqd/43
   242 root      20   0       0      0      0 S   0.7  0.0   0:21.40 
ksoftirqd/46
   267 root      20   0       0      0      0 S   0.7  0.0   0:08.62 
ksoftirqd/51
  5174 root      20   0       0      0      0 I   0.7  0.0   5:57.10 
kworker/u112:3-

>
>
>>
>>
>>> going on there - and hard to catch - cause perf top doestn chenged
>>> besides there is no queued slowpath hit now
>>>
>>> I ordered now also intel cards to compare - but 3 weeks eta
>>> Faster - cause 3 days - i will have mellanox connectx 5 - so can
>>> separate traffic to two different x16 pcie busses
>> I do think you need to separate traffic to two different x16 PCIe
>> slots.  I have found that the ConnectX-5 is significantly faster
>> packet-per-sec performance than ConnectX-4, but that is not your
>> use-case (max BW). I've not tested these NICs for maximum
>> _bidirectional_ bandwidth limits, I've only made sure I can do 100G
>> unidirectional, which can hit some funny motherboard memory limits
>> (remember to equip motherboard with 4 RAM blocks for full memory BW).
>>
> Yes memory channels are separated and there are 4 modules per cpu :)
>
>

^ permalink raw reply

* Re: [net 00/14][pull request] Intel Wired LAN Driver Updates 2018-11-07
From: David Miller @ 2018-11-08  1:14 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: netdev, nhorman, sassmann
In-Reply-To: <20181107191631.5072-1-jeffrey.t.kirsher@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed,  7 Nov 2018 11:16:17 -0800

> This series contains fixes to igb, i40e and ice drivers.

Pulled, thanks Jeff.

^ permalink raw reply

* Re: [PATCH bpf] tools/bpftool: copy uapi/linux/tc_act/tc_bpf.h to tools directory
From: Li Zhijian @ 2018-11-08  1:15 UTC (permalink / raw)
  To: Yonghong Song, ast, kafai, daniel, netdev, rong.a.chen; +Cc: kernel-team
In-Reply-To: <20181108010011.3982963-1-yhs@fb.com>

On 11/8/2018 9:00 AM, Yonghong Song wrote:
> Commit f6f3bac08ff9 ("tools/bpf: bpftool: add net support")
> added certain networking support to bpftool.
> The implementation relies on a relatively recent uapi header file
> linux/tc_act/tc_bpf.h on the host which contains the marco
> definition of TCA_ACT_BPF_ID.
>
> Unfortunately, this is not the case for all distributions.
> See the email message below where rhel-7.2 does not have
> an up-to-date linux/tc_act/tc_bpf.h.
>    https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1799211.html

i have not tested this patch, but basing on the early commit
6f3bac08ff9 ("tools/bpf: bpftool: add net support")
i cooked up similar patch locally, but i noticed that it also requires an
up-to-date linux/pkt_cls.h as well to avoid compiling errors:

root@lkp-bdw-ep3 ~/linux-f6f3bac08f/tools/bpf/bpftool# make V=1
[...snip...]
gcc -O2 -W -Wall -Wextra -Wno-unused-parameter -Wshadow -Wno-missing-field-initializers -DPACKAGE='"bpftool"' -D__EXPORTED_HEADERS__ -I/root/linux-f6f3bac08f/kernel/bpf/ -I/root/linux-f6f3bac08f/tools/include -I/root/linux-f6f3bac08f/tools/include/uapi -I/root/linux-f6f3bac08f/tools/lib/bpf -I/root/linux-f6f3bac08f/tools/perf -DBPFTOOL_VERSION='"4.19.0-rc2"' -DCOMPAT_NEED_REALLOCARRAY   -c -MMD -o netlink_dumper.o netlink_dumper.c
make -C /root/linux-f6f3bac08f/tools/lib/bpf/ OUTPUT= libbpf.a
make[1]: Entering directory '/root/linux-f6f3bac08f/tools/lib/bpf'
netlink_dumper.c: In function 'do_bpf_filter_dump':
netlink_dumper.c:153:9: error: 'TCA_BPF_ID' undeclared (first use in this function)
   if (tb[TCA_BPF_ID])
          ^~~~~~~~~~
netlink_dumper.c:153:9: note: each undeclared identifier is reported only once for each function it appears in
netlink_dumper.c:155:9: error: 'TCA_BPF_TAG' undeclared (first use in this function)
   if (tb[TCA_BPF_TAG])
          ^~~~~~~~~~~
Makefile:96: recipe for target 'netlink_dumper.o' failed
make: *** [netlink_dumper.o] Error 1
make: *** Waiting for unfinished jobs....
make -f /root/linux-f6f3bac08f/tools/build/Makefile.build dir=. obj=libbpf
make[1]: Leaving directory '/root/linux-f6f3bac08f/tools/lib/bpf'

Thanks
Zhijian

>
> This patch fixed the issue by copying linux/tc_act/tc_bpf.h from
> kernel include/uapi directory to tools/include/uapi directory so
> building the bpftool does not depend on host system for this file.
>
> Fixes: f6f3bac08ff9 ("tools/bpf: bpftool: add net support")
> Reported-by: kernel test robot <rong.a.chen@intel.com>
> Signed-off-by: Yonghong Song <yhs@fb.com>
> ---
>   tools/include/uapi/linux/tc_act/tc_bpf.h | 37 ++++++++++++++++++++++++
>   1 file changed, 37 insertions(+)
>   create mode 100644 tools/include/uapi/linux/tc_act/tc_bpf.h
>
> diff --git a/tools/include/uapi/linux/tc_act/tc_bpf.h b/tools/include/uapi/linux/tc_act/tc_bpf.h
> new file mode 100644
> index 000000000000..6e89a5df49a4
> --- /dev/null
> +++ b/tools/include/uapi/linux/tc_act/tc_bpf.h
> @@ -0,0 +1,37 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +/*
> + * Copyright (c) 2015 Jiri Pirko <jiri@resnulli.us>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#ifndef __LINUX_TC_BPF_H
> +#define __LINUX_TC_BPF_H
> +
> +#include <linux/pkt_cls.h>
> +
> +#define TCA_ACT_BPF 13
> +
> +struct tc_act_bpf {
> +	tc_gen;
> +};
> +
> +enum {
> +	TCA_ACT_BPF_UNSPEC,
> +	TCA_ACT_BPF_TM,
> +	TCA_ACT_BPF_PARMS,
> +	TCA_ACT_BPF_OPS_LEN,
> +	TCA_ACT_BPF_OPS,
> +	TCA_ACT_BPF_FD,
> +	TCA_ACT_BPF_NAME,
> +	TCA_ACT_BPF_PAD,
> +	TCA_ACT_BPF_TAG,
> +	TCA_ACT_BPF_ID,
> +	__TCA_ACT_BPF_MAX,
> +};
> +#define TCA_ACT_BPF_MAX (__TCA_ACT_BPF_MAX - 1)
> +
> +#endif

^ permalink raw reply

* Re: [PATCH] [stable, netdev 4.4+] lan78xx: make sure RX_ADDRL & RX_ADDRH regs are always up to date
From: Paolo Pisati @ 2018-11-08 11:01 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Paolo Pisati, Woojung Huh, Microchip Linux Driver Support, netdev,
	stable, linux-usb, linux-kernel
In-Reply-To: <20181108001751.GA8097@sasha-vm>

On Wed, Nov 07, 2018 at 07:17:51PM -0500, Sasha Levin wrote:
> So why not just take 760db29bdc completely? It looks safer than taking a
> partial backport, and will make applying future patches easier.
> 
> I tried to do it and it doesn't look like there are any dependencies
> that would cause an issue.

Somehow i was convinced it didn't build on 4.4.x... can you pick it up?

commit 760db29bdc97b73ff60b091315ad787b1deb5cf5
Author: Phil Elwell <phil@raspberrypi.org>
Date:   Thu Apr 19 17:59:38 2018 +0100

    lan78xx: Read MAC address from DT if present
    
    There is a standard mechanism for locating and using a MAC address from
    the Device Tree. Use this facility in the lan78xx driver to support
    applications without programmed EEPROM or OTP. At the same time,
    regularise the handling of the different address sources.
    
    Signed-off-by: Phil Elwell <phil@raspberrypi.org>
    Signed-off-by: David S. Miller <davem@davemloft.net>
-- 
bye,
p.

^ permalink raw reply

* [Kernel][NET] Bug report on packet defragmenting
From: 배석진 @ 2018-11-08  1:29 UTC (permalink / raw)
  To: netdev@vger.kernel.org
In-Reply-To: <CGME20181108012927epcms1p47f719c1908da64a378690362901644ee@epcms1p4>

Hello,
This is bae working on Samsung Elec. 

We got the problem that fragmented SIP packet couldn't be deliverd to user layer.
And found that they were stoled at HOOK function, ipv6_defrag.

In condition with SMP and RPS.
After first fragmented packet, they have no further network header except ip.
But __skb_flow_dissect function using the port field to determine hash key, 'ports'.
So each packet get different hash key, and be sent to different core.
Although hash is different, selected cpu could be same. but it just lucky. [exam 2]

And addition, when those packets arrived with little time gap.
They became ran the ipv6_defrag hook simultaneously in each core.
So they each be treated to first fragmented packet.
And they can't merged to original packet, and can't be deliverd to upper. [exam 1]

If ipv6_defrag hook is not excuted simultaneously, then it's ok.
ipv6_defrag hook can handle that. [exam 3]

We'll skip 'ports' setting when the packet was fragmented.
Because of IPv6 SIP invite packet is usally fragmented, this problem is very often.



>From be74b56861cf76a16d0f2d054d468c584ed67cce Mon Sep 17 00:00:00 2001
From: soukjin bae <soukjin.bae@samsung.com>
Date: Thu, 8 Nov 2018 09:52:29 +0900
Subject: [PATCH] flow_dissector: don't refer port field in fragmented packet

After first fragmented packet, they have no further network header except ip.
So when try to refer port field in nexthdr, they got the garbage from payload.
Skip port set when the packet was fragmented.

Signed-off-by: soukjin bae <soukjin.bae@samsung.com>
---
 net/core/flow_dissector.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 676f3ad629f9..928df25129ba 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1166,8 +1166,8 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
 		break;
 	}
 
-	if (dissector_uses_key(flow_dissector,
-			       FLOW_DISSECTOR_KEY_PORTS)) {
+	if (dissector_uses_key(flow_dissector, FLOW_DISSECTOR_KEY_PORTS)
+	    && !(key_control->flags & FLOW_DIS_IS_FRAGMENT)) {
 		key_ports = skb_flow_dissector_target(flow_dissector,
 						      FLOW_DISSECTOR_KEY_PORTS,
 						      target_container);
-- 
2.13.0



[exam 1]
# tcp dump #
<TIME>          <Source>            <S.Port>    <Destination>               <D.Port>    <Proto> <Len>   <Info>
13:40:11.938345 2001:4430:5:401::53             2001:4430:140:a635::3cd:9f3             IPv6    396	    IPv6 fragment
0010   6b 80 00 00 01 54 2c 3d 20 01 44 30 00 05 04 01
0020   00 00 00 00 00 00 00 53 20 01 44 30 01 40 a6 35
0030   00 00 00 00 03 cd 09 f3 32 00 05 88 50 2a 54 16  -> ipv6 hdr with fragment hdr
0040   5d 6a a0 da 01 14 26 5a 85 ba 51 9f 17 75 04 9c  -> fragmented payload area. ipv6_defrag using this area as port

<TIME>          <Source>            <S.Port>    <Destination>               <D.Port>    <Proto> <Len>   <Info>
13:40:11.937654 2001:4430:5:401::53 7538        2001:4430:140:a635::3cd:9f3 6300        SIP/SDP 1480    Request: INVITE
0010   6b 80 00 00 05 90 2c 3d 20 01 44 30 00 05 04 01
0020   00 00 00 00 00 00 00 53 20 01 44 30 01 40 a6 35
0030   00 00 00 00 03 cd 09 f3 32 00 00 01 50 2a 54 16  -> ipv6 hdr with fragment hdr
0040   00 00 c4 c3 00 00 00 06 da 4d d7 26 a1 d7 64 c6  -> UDP hdr, right value for port. but only just this packet can be refer.

# kernel log #
11-07 13:40:12.296 I[3:      swapper/3:    0] LNK-RX(1464): 6b 80 00 00 05 90 2c 3d 20 01 44 30 00 05 04 01 ...  --> our NIC log when pkt rcv
11-07 13:40:12.297 I[3:      swapper/3:    0] __skb_flow_dissect: ports: c3c40000                                --> right value for port
11-07 13:40:12.297 I[3:      swapper/3:    0] get_rps_cpu: cpu = 2 (hash:2758499534)
11-07 13:40:12.297 I[3:      swapper/3:    0] LNK-RX(380): 6b 80 00 00 01 54 2c 3d 20 01 44 30 00 05 04 01 ...
11-07 13:40:12.297 I[3:      swapper/3:    0] __skb_flow_dissect: ports: daa06a5d                                --> but at here...
11-07 13:40:12.298 I[3:      swapper/3:    0] get_rps_cpu: cpu = 1 (hash:791526712)                              --> so this pkt has different hash, cpu
11-07 13:40:12.298 I[2:      swapper/2:    0] ipv6_defrag+++
11-07 13:40:12.298 I[1:      swapper/1:    0] ipv6_defrag+++                                                     --> go into ipv6_defrag at same time
11-07 13:40:12.298 I[1:      swapper/1:    0] ipv6_defrag: EINPROGRESS
11-07 13:40:12.298 I[2:      swapper/2:    0] ipv6_defrag: EINPROGRESS                                           --> both packet was treated to first frag



[exam 2]
# tcp dump #
<TIME>          <Source>            <S.Port>    <Destination>               <D.Port>    <Proto> <Len>   <Info>
13:40:13.947576 2001:4430:5:401::53             2001:4430:140:a635::3cd:9f3             IPv6    1480    IPv6 fragment
0010   6b 80 00 00 05 90 2c 3d 20 01 44 30 00 05 04 01
0020   00 00 00 00 00 00 00 53 20 01 44 30 01 40 a6 35
0030   00 00 00 00 03 cd 09 f3 32 00 00 01 50 2a 54 1c
0040   00 00 c4 c3 00 00 00 07 2e bb 86 7f 97 4f 58 7c

<TIME>          <Source>            <S.Port>    <Destination>               <D.Port>    <Proto> <Len>   <Info>
13:40:13.948379 2001:4430:5:401::53 7538        2001:4430:140:a635::3cd:9f3 6300        SIP/SDP 396     Request: INVITE
0010   6b 80 00 00 01 54 2c 3d 20 01 44 30 00 05 04 01
0020   00 00 00 00 00 00 00 53 20 01 44 30 01 40 a6 35
0030   00 00 00 00 03 cd 09 f3 32 00 05 88 50 2a 54 1c
0040   24 fd 06 d4 b9 23 5b c4 49 9e 1f 3e be f5 12 67

# kernel log #
11-07 13:40:14.306 I[3:      swapper/3:    0] LNK-RX(1464): 6b 80 00 00 05 90 2c 3d 20 01 44 30 00 05 04 01 ...
11-07 13:40:14.306 I[3:      swapper/3:    0] __skb_flow_dissect: ports: c3c40000
11-07 13:40:14.307 I[3:      swapper/3:    0] get_rps_cpu: cpu = 2 (hash:2758499534)
11-07 13:40:14.307 I[3:      swapper/3:    0] LNK-RX(380): 6b 80 00 00 01 54 2c 3d 20 01 44 30 00 05 04 01 ...
11-07 13:40:14.307 I[3:      swapper/3:    0] __skb_flow_dissect: ports: d406fd24
11-07 13:40:14.308 I[3:      swapper/3:    0] get_rps_cpu: cpu = 2 (hash:2624600197)                             --> different hash, but same cpu by unhash.
11-07 13:40:14.308 I[2:      swapper/2:    0] ipv6_defrag+++
11-07 13:40:14.308 I[2:      swapper/2:    0] ipv6_defrag: EINPROGRESS
11-07 13:40:14.308 I[2:      swapper/2:    0] ipv6_defrag+++
11-07 13:40:14.309 I[2:      swapper/2:    0] UDP: __udp_enqueue_schedule_skb: qlen=1                            --> deliverd to upper


[exam 3]
# tcp dump #
<TIME>          <Source>            <S.Port>    <Destination>               <D.Port>    <Proto> <Len>   <Info>
13:40:05.514191 2001:4430:5:401::53             2001:4430:140:a635::3cd:9f3             IPv6    1480    IPv6 fragment
0010   6b 80 00 00 05 90 2c 3d 20 01 44 30 00 05 04 01
0020   00 00 00 00 00 00 00 53 20 01 44 30 01 40 a6 35
0030   00 00 00 00 03 cd 09 f3 32 00 00 01 50 2a 54 03
0040   00 00 c4 c3 00 00 00 03 27 e0 dd cc fe 77 a0 64

<TIME>          <Source>            <S.Port>    <Destination>               <D.Port>    <Proto> <Len>   <Info>
13:40:05.517187 2001:4430:5:401::53 7538        2001:4430:140:a635::3cd:9f3 6300        SIP/SDP 396     Request: INVITE
0010   6b 80 00 00 01 54 2c 3d 20 01 44 30 00 05 04 01
0020   00 00 00 00 00 00 00 53 20 01 44 30 01 40 a6 35
0030   00 00 00 00 03 cd 09 f3 32 00 05 88 50 2a 54 03
0040   b8 d2 31 bd b3 d7 89 d2 d3 07 99 36 28 3c 37 a4

# kernel log #
11-07 13:40:05.872 I[3:      swapper/3:    0] LNK-RX(1464): 6b 80 00 00 05 90 2c 3d 20 01 44 30 00 05 04 01 ...
11-07 13:40:05.874 I[3:      swapper/3:    0] __skb_flow_dissect: ports: c3c40000
11-07 13:40:05.875 I[3:      swapper/3:    0] get_rps_cpu: cpu = 2 (hash:2758499534)
11-07 13:40:05.876 I[3:      swapper/3:    0] LNK-RX(380): 6b 80 00 00 01 54 2c 3d 20 01 44 30 00 05 04 01 ...
11-07 13:40:05.878 I[3:      swapper/3:    0] __skb_flow_dissect: ports: bd31d2b8
11-07 13:40:05.878 I[3:      swapper/3:    0] get_rps_cpu: cpu = 3 (hash:3167003083)                             --> different cpu
11-07 13:40:05.879 I[2:      swapper/2:    0] ipv6_defrag+++
11-07 13:40:05.879 I[2:      swapper/2:    0] ipv6_defrag: EINPROGRESS
11-07 13:40:05.881 I[3:    ksoftirqd/3:   33] ipv6_defrag+++                                                     --> but successfully handled by ipv6_defrag
11-07 13:40:05.883 I[3:    ksoftirqd/3:   33] UDP: __udp_enqueue_schedule_skb: qlen=1                            --> deliverd to upper

^ permalink raw reply related

* [PATCH net-next 0/7] net: sched: prepare for more Qdisc offloads
From: Jakub Kicinski @ 2018-11-08  1:33 UTC (permalink / raw)
  To: davem
  Cc: netdev, oss-drivers, jiri, xiyou.wangcong, jhs, nogah.frankel,
	yuvalm, Jakub Kicinski

Hi!

This series refactors the "switchdev" Qdisc offloads a little.  We have
a few Qdiscs which can be fully offloaded today to the forwarding plane
of switching devices.

First patch adds a helper for handing statistic dumps, the code seems
to be copy pasted between PRIO and RED.  Second patch removes unnecessary
parameter from RED offload function.  Third patch makes the MQ offload
use the dump helper which helps it behave much like PRIO and RED when
it comes to the TCQ_F_OFFLOADED flag.  Patch 4 adds a graft helper,
similar to the dump helper.

Patch 5 is unrelated to offloads, qdisc_graft() code seemed ripe for a
small refactor - no functional changes there.

Last two patches move the qdisc_put() call outside of the sch_tree_lock
section for RED and PRIO.  The child Qdiscs will get removed from the
hierarchy under the lock, but having the put (and potentially destroy)
called outside of the lock helps offload which may choose to sleep,
and it should generally lower the Qdisc change impact.

Jakub Kicinski (7):
  net: sched: add an offload dump helper
  net: sched: red: remove unnecessary red_dump_offload_stats parameter
  net: sched: set TCQ_F_OFFLOADED flag for MQ
  net: sched: add an offload graft helper
  net: sched: refactor grafting Qdiscs with a parent
  net: sched: red: delay destroying child qdisc on replace
  net: sched: prio: delay destroying child qdiscs on change

 include/net/sch_generic.h | 24 ++++++++++++
 net/sched/sch_api.c       | 78 ++++++++++++++++++++++++++++++++-------
 net/sched/sch_mq.c        |  9 ++---
 net/sched/sch_prio.c      | 47 ++++-------------------
 net/sched/sch_red.c       | 29 +++++----------
 5 files changed, 107 insertions(+), 80 deletions(-)

-- 
2.17.1

^ permalink raw reply

* [PATCH net-next 1/7] net: sched: add an offload dump helper
From: Jakub Kicinski @ 2018-11-08  1:33 UTC (permalink / raw)
  To: davem
  Cc: netdev, oss-drivers, jiri, xiyou.wangcong, jhs, nogah.frankel,
	yuvalm, Jakub Kicinski
In-Reply-To: <20181108013340.20983-1-jakub.kicinski@netronome.com>

Qdisc dump operation of offload-capable qdiscs performs a few
extra steps which are identical among all the qdiscs.  Add
a helper to share this code.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
---
 include/net/sch_generic.h | 12 ++++++++++++
 net/sched/sch_api.c       | 21 +++++++++++++++++++++
 net/sched/sch_prio.c      | 16 +---------------
 net/sched/sch_red.c       | 17 +----------------
 4 files changed, 35 insertions(+), 31 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 4d736427a4cb..af55c1c4edb1 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -579,6 +579,18 @@ void qdisc_put(struct Qdisc *qdisc);
 void qdisc_put_unlocked(struct Qdisc *qdisc);
 void qdisc_tree_reduce_backlog(struct Qdisc *qdisc, unsigned int n,
 			       unsigned int len);
+#ifdef CONFIG_NET_SCHED
+int qdisc_offload_dump_helper(struct Qdisc *q, enum tc_setup_type type,
+			      void *type_data);
+#else
+static inline int
+qdisc_offload_dump_helper(struct Qdisc *q, enum tc_setup_type type,
+			  void *type_data)
+{
+	q->flags &= ~TCQ_F_OFFLOADED;
+	return 0;
+}
+#endif
 struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 			  const struct Qdisc_ops *ops,
 			  struct netlink_ext_ack *extack);
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index ca3b0f46de53..e534825d3d3a 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -810,6 +810,27 @@ void qdisc_tree_reduce_backlog(struct Qdisc *sch, unsigned int n,
 }
 EXPORT_SYMBOL(qdisc_tree_reduce_backlog);
 
+int qdisc_offload_dump_helper(struct Qdisc *sch, enum tc_setup_type type,
+			      void *type_data)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	int err;
+
+	sch->flags &= ~TCQ_F_OFFLOADED;
+	if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc)
+		return 0;
+
+	err = dev->netdev_ops->ndo_setup_tc(dev, type, type_data);
+	if (err == -EOPNOTSUPP)
+		return 0;
+
+	if (!err)
+		sch->flags |= TCQ_F_OFFLOADED;
+
+	return err;
+}
+EXPORT_SYMBOL(qdisc_offload_dump_helper);
+
 static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc *q, u32 clid,
 			 u32 portid, u32 seq, u16 flags, int event)
 {
diff --git a/net/sched/sch_prio.c b/net/sched/sch_prio.c
index f8af98621179..4bdd04c30ead 100644
--- a/net/sched/sch_prio.c
+++ b/net/sched/sch_prio.c
@@ -251,7 +251,6 @@ static int prio_init(struct Qdisc *sch, struct nlattr *opt,
 
 static int prio_dump_offload(struct Qdisc *sch)
 {
-	struct net_device *dev = qdisc_dev(sch);
 	struct tc_prio_qopt_offload hw_stats = {
 		.command = TC_PRIO_STATS,
 		.handle = sch->handle,
@@ -263,21 +262,8 @@ static int prio_dump_offload(struct Qdisc *sch)
 			},
 		},
 	};
-	int err;
-
-	sch->flags &= ~TCQ_F_OFFLOADED;
-	if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc)
-		return 0;
-
-	err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_PRIO,
-					    &hw_stats);
-	if (err == -EOPNOTSUPP)
-		return 0;
-
-	if (!err)
-		sch->flags |= TCQ_F_OFFLOADED;
 
-	return err;
+	return qdisc_offload_dump_helper(sch, TC_SETUP_QDISC_PRIO, &hw_stats);
 }
 
 static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)
diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index 3ce6c0a2c493..d5e441194397 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -281,7 +281,6 @@ static int red_init(struct Qdisc *sch, struct nlattr *opt,
 
 static int red_dump_offload_stats(struct Qdisc *sch, struct tc_red_qopt *opt)
 {
-	struct net_device *dev = qdisc_dev(sch);
 	struct tc_red_qopt_offload hw_stats = {
 		.command = TC_RED_STATS,
 		.handle = sch->handle,
@@ -291,22 +290,8 @@ static int red_dump_offload_stats(struct Qdisc *sch, struct tc_red_qopt *opt)
 			.stats.qstats = &sch->qstats,
 		},
 	};
-	int err;
-
-	sch->flags &= ~TCQ_F_OFFLOADED;
-
-	if (!tc_can_offload(dev) || !dev->netdev_ops->ndo_setup_tc)
-		return 0;
-
-	err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_QDISC_RED,
-					    &hw_stats);
-	if (err == -EOPNOTSUPP)
-		return 0;
-
-	if (!err)
-		sch->flags |= TCQ_F_OFFLOADED;
 
-	return err;
+	return qdisc_offload_dump_helper(sch, TC_SETUP_QDISC_RED, &hw_stats);
 }
 
 static int red_dump(struct Qdisc *sch, struct sk_buff *skb)
-- 
2.17.1

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox