Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCHv2 net-next 1/6] sctp: add sctp_info dump api for sctp_diag
From: Eric Dumazet @ 2016-04-09 17:31 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, Vlad Yasevich,
	daniel, davem
In-Reply-To: <CADvbK_dzvjT=pFYP8uRDa8naNWUp+dkUHSVXP+Y8P0U3=nQM8g@mail.gmail.com>

On Sun, 2016-04-10 at 00:10 +0800, Xin Long wrote:

> 1. sctp_diag_dump_one -> sctp_transport_lookup_process-> sctp_tsp_dump_one
> this one just holds the tsp. and we're using  list_for_each_safe here now,
> isn't it enough ?

list_for_each_safe is 'safe' if you do a list_del() yourself.

It is not safe if other cpus are adding/deleting items in the list while
this thread is iterating it.

^ permalink raw reply

* Re: [PATCHv2 net-next 1/6] sctp: add sctp_info dump api for sctp_diag
From: Eric Dumazet @ 2016-04-09 17:29 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, Vlad Yasevich,
	daniel, davem
In-Reply-To: <CADvbK_dzvjT=pFYP8uRDa8naNWUp+dkUHSVXP+Y8P0U3=nQM8g@mail.gmail.com>

On Sun, 2016-04-10 at 00:10 +0800, Xin Long wrote:
> On Sat, Apr 9, 2016 at 1:19 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Sat, 2016-04-09 at 12:53 +0800, Xin Long wrote:
> >> sctp_diag will dump some important details of sctp's assoc or ep, we use
> >> sctp_info to describe them,  sctp_get_sctp_info to get them, and export
> >> it to sctp_diag.ko.
> >>
> >
> >
> >> +int sctp_get_sctp_info(struct sock *sk, struct sctp_association *asoc,
> >> +                    struct sctp_info *info)
> >> +{
> >> +     struct sctp_transport *prim;
> >> +     struct list_head *pos, *temp;
> >> +     int mask;
> >> +
> >> +     memset(info, 0, sizeof(*info));
> >> +     if (!asoc) {
> >> +             struct sctp_sock *sp = sctp_sk(sk);
> >> +
> >> +             info->sctpi_s_autoclose = sp->autoclose;
> >> +             info->sctpi_s_adaptation_ind = sp->adaptation_ind;
> >> +             info->sctpi_s_pd_point = sp->pd_point;
> >> +             info->sctpi_s_nodelay = sp->nodelay;
> >> +             info->sctpi_s_disable_fragments = sp->disable_fragments;
> >> +             info->sctpi_s_v4mapped = sp->v4mapped;
> >> +             info->sctpi_s_frag_interleave = sp->frag_interleave;
> >> +
> >> +             return 0;
> >> +     }
> >> +
> >> +     info->sctpi_tag = asoc->c.my_vtag;
> >> +     info->sctpi_state = asoc->state;
> >> +     info->sctpi_rwnd = asoc->a_rwnd;
> >> +     info->sctpi_unackdata = asoc->unack_data;
> >> +     info->sctpi_penddata = sctp_tsnmap_pending(&asoc->peer.tsn_map);
> >> +     info->sctpi_instrms = asoc->c.sinit_max_instreams;
> >> +     info->sctpi_outstrms = asoc->c.sinit_num_ostreams;
> >> +     list_for_each_safe(pos, temp, &asoc->base.inqueue.in_chunk_list)
> >> +             info->sctpi_inqueue++;
> >> +     list_for_each_safe(pos, temp, &asoc->outqueue.out_chunk_list)
> >> +             info->sctpi_outqueue++;
> >
> > Is this safe ?
> >
> > Do you own the lock on socket or whatever lock protecting this list ?
> >
> there are 2 places will call these codes,
> 1. sctp_diag_dump -> sctp_for_each_transport -> sctp_tsp_dump
> this one will use lock_sock to protect them. I think this one is ok.
> 
> 1. sctp_diag_dump_one -> sctp_transport_lookup_process-> sctp_tsp_dump_one
> this one just holds the tsp. and we're using  list_for_each_safe here now,
> isn't it enough ?
> 

You tell me ;)

For sure in tcp_get_info() the socket is sometimes locked, and sometimes
is not locked, depending on the caller.

So we had to use only lockless accesses.

^ permalink raw reply

* Re: [RFC PATCH v2 5/5] Add sample for adding simple drop program to link
From: Jamal Hadi Salim @ 2016-04-09 17:27 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel, brouer,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo
In-Reply-To: <20160409164308.GA5750@gmail.com>

On 16-04-09 12:43 PM, Brenden Blanco wrote:
> On Sat, Apr 09, 2016 at 10:48:05AM -0400, Jamal Hadi Salim wrote:

>> Ok, sorry - should have looked this far before sending earlier email.
>> So when you run concurently you see about 5Mpps per core but if you
>> shoot all traffic at a single core you see 20Mpps?
> No, only sender is multiple, receiver is still single core. The flow is
> the same in all 4 of the send threads. Note that only ksoftirqd/6 is
> active.

Got it.
The sender was limited to the 20Mpps and you are able to keep up
if i understand correctly.

>>
>> Devil's advocate question:
>> If the bottleneck is the driver - is there an advantage in adding the
>> bpf code at all in the driver?
> Only by adding this hook into the driver has it become the bottleneck.
 >
> Prior to this, the bottleneck was later in the codepath, primarily in
> allocations.
>

Maybe useful in your commit log to show the prior and after.
Looking at both your and Daniel's profile you show in this email
mlx4_en_process_rx_cq() seems to be where the action is on both, no?

> If a packet is to be dropped, and a determination can be made with fewer
> cpu cycles spent, then there is more time for the goodput.
>

Agreed.

> Beyond that, even if the skb allocation gets 10x or 100x or whatever
> improvement, there is still a non-zero cost associated, and dropping bad
> packets with minimal time spent has value. The same argument holds for
> physical nic forwarding decisions.
>

I always go for the lowest hanging fruit.
It seemed it was the driver path in your case. When we removed
the driver overhead (as demoed at the tc workshop in netdev11) we saw
__netif_receive_skb_core() at the top of the profile.
So in this case seems it was mlx4_en_process_rx_cq() - thats why i
was saying the bottleneck is the driver.
Having said that: I agree that early drop is useful if not for anything
else to avoid the longer code path (but was worried after reading on
thread this was going to get into a messy stack-in-the-driver and i am
not sure it is avoidable either given a new ops interface is showing
  up).

>> I am curious than before to see the comparison for the same bpf code
>> running at tc level vs in the driver..
> Here is a perf report for drop in the clsact qdisc with direct-action,
> which Daniel earlier showed to have the best performance to-date. On my
> machine, this gets about 6.5Mpps drop single core. Drop due to failed
> IP lookup (not shown here) is worse @4.5Mpps.
>

Nice.
However, still for this to be orange/orange comparison you have to
run it on the _same receiver machine_ as opposed to Daniel doing
it on his for the one case. And two different kernels booted up
one patched  with your changes and another virgin without them.

cheers,
jamal

^ permalink raw reply

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Alexei Starovoitov @ 2016-04-09 17:26 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Tom Herbert, Jesper Dangaard Brouer, Brenden Blanco,
	David S. Miller, Linux Kernel Network Developers, Or Gerlitz,
	Daniel Borkmann, Eric Dumazet, Edward Cree, john fastabend,
	Thomas Graf, Johannes Berg, eranlinuxmellanox, Lorenzo Colitti,
	linux-mm
In-Reply-To: <57091FCE.50104@mojatatu.com>

On Sat, Apr 09, 2016 at 11:29:18AM -0400, Jamal Hadi Salim wrote:
> On 16-04-09 07:29 AM, Tom Herbert wrote:
> 
> >+1. Forwarding which will be a common application almost always
> >requires modification (decrement TTL), and header data split has
> >always been a weak feature since the device has to have some arbitrary
> >rules about what headers needs to be split out (either implements
> >protocol specific parsing or some fixed length).
> 
> Then this is sensible. I was cruising the threads and
> confused by your earlier emails Tom because you talked
> about XPS etc. It sounded like the idea evolved into putting
> the whole freaking stack on bpf.

yeah, no stack, no queues in bpf.

> If this is _forwarding only_ it maybe useful to look at
> Alexey's old code in particular the DMA bits;
> he built his own lookup algorithm but sounds like bpf is
> a much better fit today.

a link to these old bits?

Just to be clear: this rfc is not the only thing we're considering.
In particular huawei guys did a monster effort to improve performance
in this area as well. We'll try to blend all the code together and
pick what's the best.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next] net: bcmgenet: use __napi_schedule_irqoff()
From: Petri Gynther @ 2016-04-09 17:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Fainelli, David Miller, netdev, opendmb
In-Reply-To: <1460179856.6473.482.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, Apr 8, 2016 at 10:30 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Florian Fainelli <f.fainelli@gmail.com>
>
> bcmgenet_isr1() and bcmgenet_isr0() run in hard irq context,
> we do not need to block irq again.
>
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Acked-by: Petri Gynther <pgynther@google.com>

> ---
>  drivers/net/ethernet/broadcom/genet/bcmgenet.c |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index f7b42b9fc979..4367d561a12e 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -2493,7 +2493,7 @@ static irqreturn_t bcmgenet_isr1(int irq, void *dev_id)
>
>                 if (likely(napi_schedule_prep(&rx_ring->napi))) {
>                         rx_ring->int_disable(rx_ring);
> -                       __napi_schedule(&rx_ring->napi);
> +                       __napi_schedule_irqoff(&rx_ring->napi);
>                 }
>         }
>
> @@ -2506,7 +2506,7 @@ static irqreturn_t bcmgenet_isr1(int irq, void *dev_id)
>
>                 if (likely(napi_schedule_prep(&tx_ring->napi))) {
>                         tx_ring->int_disable(tx_ring);
> -                       __napi_schedule(&tx_ring->napi);
> +                       __napi_schedule_irqoff(&tx_ring->napi);
>                 }
>         }
>
> @@ -2536,7 +2536,7 @@ static irqreturn_t bcmgenet_isr0(int irq, void *dev_id)
>
>                 if (likely(napi_schedule_prep(&rx_ring->napi))) {
>                         rx_ring->int_disable(rx_ring);
> -                       __napi_schedule(&rx_ring->napi);
> +                       __napi_schedule_irqoff(&rx_ring->napi);
>                 }
>         }
>
> @@ -2545,7 +2545,7 @@ static irqreturn_t bcmgenet_isr0(int irq, void *dev_id)
>
>                 if (likely(napi_schedule_prep(&tx_ring->napi))) {
>                         tx_ring->int_disable(tx_ring);
> -                       __napi_schedule(&tx_ring->napi);
> +                       __napi_schedule_irqoff(&tx_ring->napi);
>                 }
>         }
>
>
>

^ permalink raw reply

* Re: [PATCHv2 net-next 4/6] sctp: add the sctp_diag.c file
From: Eric Dumazet @ 2016-04-09 17:23 UTC (permalink / raw)
  To: Xin Long
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, Vlad Yasevich,
	daniel, davem
In-Reply-To: <CADvbK_dZDDCpEm-RF_Zv3Xq6kvnfGnnk822cON5rUJPo-R8Y4Q@mail.gmail.com>

On Sat, 2016-04-09 at 23:40 +0800, Xin Long wrote:

> you meant we can remove it here ?
> yes, it seems similar with INET_DIAG_SKMEMINFO.
> but I do not know if userpace may use  INET_DIAG_MEMINFO now.
> 

You are adding new features here. No problem of legacy code.

Anyway, as I said you really need to reuse code, not copy paste please.

^ permalink raw reply

* Re: [PATCH net-next] net: bcmgenet: use napi_complete_done()
From: Petri Gynther @ 2016-04-09 17:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Florian Fainelli
In-Reply-To: <1460178400.6473.469.camel@edumazet-glaptop3.roam.corp.google.com>

On Fri, Apr 8, 2016 at 10:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> By using napi_complete_done(), we allow fine tuning
> of /sys/class/net/ethX/gro_flush_timeout for higher GRO aggregation
> efficiency for a Gbit NIC.
>
> Check commit 24d2e4a50737 ("tg3: use napi_complete_done()") for details.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Petri Gynther <pgynther@google.com>
> Cc: Florian Fainelli <f.fainelli@gmail.com>

Acked-by: Petri Gynther <pgynther@google.com>

> ---
>  drivers/net/ethernet/broadcom/genet/bcmgenet.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/broadcom/genet/bcmgenet.c b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> index f7b42b9fc979..e823013d3125 100644
> --- a/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> +++ b/drivers/net/ethernet/broadcom/genet/bcmgenet.c
> @@ -1735,7 +1735,7 @@ static int bcmgenet_rx_poll(struct napi_struct *napi, int budget)
>         work_done = bcmgenet_desc_rx(ring, budget);
>
>         if (work_done < budget) {
> -               napi_complete(napi);
> +               napi_complete_done(napi, work_done);
>                 ring->int_enable(ring);
>         }
>
>
>

^ permalink raw reply

* Re: [PATCHv2 net-next 1/6] sctp: add sctp_info dump api for sctp_diag
From: Eric Dumazet @ 2016-04-09 17:21 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Xin Long, network dev, linux-sctp, Marcelo Ricardo Leitner,
	Vlad Yasevich, daniel, davem
In-Reply-To: <57091D90.4030905@mojatatu.com>

On Sat, 2016-04-09 at 11:19 -0400, Jamal Hadi Salim wrote:
> On 16-04-09 01:16 AM, Eric Dumazet wrote:
> 
> >
> > Lots of holes in this structure...
> >
> >
> 
> 
> I may have mentioned to you that there is 8 bit hole in tcp_info too ;->
> (above tcpi_rto). Adding an 8 bit explicit pad seems useful
> since it is already in the wild. I was going to send the patch after
> netdev11 but  forgot.

Well, once a hole is there, nothing we can do really, because of
compatibility with old kernels / old binaries.


But when a _new_ structure is defined, this is the time where we can ask
for doing sensible things ;)

^ permalink raw reply

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Alexei Starovoitov @ 2016-04-09 17:00 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Brenden Blanco, David S. Miller, Linux Kernel Network Developers,
	Or Gerlitz, Daniel Borkmann, Jesper Dangaard Brouer, Eric Dumazet,
	Edward Cree, john fastabend, Thomas Graf, Johannes Berg,
	eranlinuxmellanox, Lorenzo Colitti
In-Reply-To: <CALx6S34m8cVNgvuGp845bicixodfavH9cj-rARSwwEAvFCjd7g@mail.gmail.com>

On Sat, Apr 09, 2016 at 08:17:04AM -0300, Tom Herbert wrote:
> >
> > +/* user return codes for PHYS_DEV prog type */
> > +enum bpf_phys_dev_action {
> > +       BPF_PHYS_DEV_DROP,
> > +       BPF_PHYS_DEV_OK,
> 
> I don't like OK. Maybe this mean LOCAL. We also need FORWARD (not sure
> how to handle GRO yet).
> 
> I would suggest that we format the return code as code:subcode, where
> the above are codes. subcode is relevant to major code. For instance
> in forwarding the subcodes indicate a forwarding instruction (maybe a
> queue). DROP subcodes might answer why.

for tc redirect we use hidden percpu variable to pass additional
info together with return code. The cost of it is extra bpf_redirect() call.
Here we can do better and embed such info for xmit,
but subcodes for drop is slippery slop, since it's adding concepts
to design that are not going to be used by everyone.
If necessary bpf programs can count drop reasons internally.
Drops due to protocol!=ipv6 or drops due to ip frags present
will be program internal reasons. No need to expose them in api.

We need to get xmit part implemented first and see how it looks
before deciding on this part of api.
Right now I think we do not need tx queue number actually.
The prog should just return 'XMIT' and xdp(driver) side will decide
which tx queue to use.

> One other API issue is how to deal with encapsulation. In this case a
> header may be prepended to the packet, I assume there are BPF helper
> functions and we don't need to return a new length or start?

a bit of history:
for tc+bpf we've been trying to come up with clean helpers to do
header push/pop and it was very difficult, since skb keeps a ton
of metedata about header offsets, csum offsets, encap flag, etc
we've lost the count on number of different approaches we've
implemented and discarded.
For XDP there is no such issue.
Likely we'll have single bpf_packet_change(ctx, off, len) helper
that will grow(len) or trim(-len) bytes at offset(off) in the packet.
ctx->len will be adjusted automatically by the helper.
The headroom, tailroom will not be exposed and will never be known
to the bpf side. It's up to the helper and the driver to decide how to
insert N bytes at offset M. If the driver reserved headroom in dma
buffer, it can grow into it, if not it can grow tail and move
the whole packet. For performance reasons we obviously want some
headroom in dma buffer, but it's not exposed to bpf.

But it could be that directly adjusting ctx->len and ctx->data is faster.
For cls_bpf ctx->data is hidden and packet access is done via
special instructions and helpers. For XDP we can hopefully do better
and do packet access with direct loads. I outlined that plan
in the previous thread.

^ permalink raw reply

* Re: [RFC PATCH v2 5/5] Add sample for adding simple drop program to link
From: Brenden Blanco @ 2016-04-09 16:43 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: davem, netdev, tom, alexei.starovoitov, ogerlitz, daniel, brouer,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo
In-Reply-To: <57091625.1010206@mojatatu.com>

On Sat, Apr 09, 2016 at 10:48:05AM -0400, Jamal Hadi Salim wrote:
> On 16-04-08 12:48 AM, Brenden Blanco wrote:
> >Add a sample program that only drops packets at the
> >BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program,
> >observed single core rate is ~19.5Mpps.
> >
> >Other tests were run, for instance without the dropcnt increment or
> >without reading from the packet header, the packet rate was mostly
> >unchanged.
> >
> >$ perf record -a samples/bpf/netdrvx1 $(</sys/class/net/eth0/ifindex)
> >proto 17:   19596362 drops/s
> >
> >./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
> >Running... ctrl^C to stop
> >Device: eth4@0
> >Result: OK: 7873817(c7872245+d1572) usec, 38801823 (60byte,0frags)
> >   4927955pps 2365Mb/sec (2365418400bps) errors: 0
> >Device: eth4@1
> >Result: OK: 7873817(c7872123+d1693) usec, 38587342 (60byte,0frags)
> >   4900715pps 2352Mb/sec (2352343200bps) errors: 0
> >Device: eth4@2
> >Result: OK: 7873817(c7870929+d2888) usec, 38718848 (60byte,0frags)
> >   4917417pps 2360Mb/sec (2360360160bps) errors: 0
> >Device: eth4@3
> >Result: OK: 7873818(c7872193+d1625) usec, 38796346 (60byte,0frags)
> >   4927259pps 2365Mb/sec (2365084320bps) errors: 0
> >
> >perf report --no-children:
> >  29.48%  ksoftirqd/6  [mlx4_en]         [k] mlx4_en_process_rx_cq
> >  18.17%  ksoftirqd/6  [mlx4_en]         [k] mlx4_en_alloc_frags
> >   8.19%  ksoftirqd/6  [mlx4_en]         [k] mlx4_en_free_frag
> >   5.35%  ksoftirqd/6  [kernel.vmlinux]  [k] get_page_from_freelist
> >   2.92%  ksoftirqd/6  [kernel.vmlinux]  [k] free_pages_prepare
> >   2.90%  ksoftirqd/6  [mlx4_en]         [k] mlx4_call_bpf
> >   2.72%  ksoftirqd/6  [fjes]            [k] 0x000000000000af66
> >   2.37%  ksoftirqd/6  [kernel.vmlinux]  [k] swiotlb_sync_single_for_cpu
> >   1.92%  ksoftirqd/6  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
> >   1.83%  ksoftirqd/6  [kernel.vmlinux]  [k] free_one_page
> >   1.70%  ksoftirqd/6  [kernel.vmlinux]  [k] swiotlb_sync_single
> >   1.69%  ksoftirqd/6  [kernel.vmlinux]  [k] bpf_map_lookup_elem
> >   1.33%  swapper      [kernel.vmlinux]  [k] intel_idle
> >   1.32%  ksoftirqd/6  [fjes]            [k] 0x000000000000af90
> >   1.21%  ksoftirqd/6  [kernel.vmlinux]  [k] sk_load_byte_positive_offset
> >   1.07%  ksoftirqd/6  [kernel.vmlinux]  [k] __alloc_pages_nodemask
> >   0.89%  ksoftirqd/6  [kernel.vmlinux]  [k] __rmqueue
> >   0.84%  ksoftirqd/6  [mlx4_en]         [k] mlx4_alloc_pages.isra.23
> >   0.79%  ksoftirqd/6  [kernel.vmlinux]  [k] net_rx_action
> >
> >machine specs:
> >  receiver - Intel E5-1630 v3 @ 3.70GHz
> >  sender - Intel E5645 @ 2.40GHz
> >  Mellanox ConnectX-3 @40G
> >
> 
> 
> Ok, sorry - should have looked this far before sending earlier email.
> So when you run concurently you see about 5Mpps per core but if you
> shoot all traffic at a single core you see 20Mpps?
No, only sender is multiple, receiver is still single core. The flow is
the same in all 4 of the send threads. Note that only ksoftirqd/6 is
active.
> 
> Devil's advocate question:
> If the bottleneck is the driver - is there an advantage in adding the
> bpf code at all in the driver?
Only by adding this hook into the driver has it become the bottleneck.
Prior to this, the bottleneck was later in the codepath, primarily in
allocations.

If a packet is to be dropped, and a determination can be made with fewer
cpu cycles spent, then there is more time for the goodput.

Beyond that, even if the skb allocation gets 10x or 100x or whatever
improvement, there is still a non-zero cost associated, and dropping bad
packets with minimal time spent has value. The same argument holds for
physical nic forwarding decisions.

> I am curious than before to see the comparison for the same bpf code
> running at tc level vs in the driver..
Here is a perf report for drop in the clsact qdisc with direct-action,
which Daniel earlier showed to have the best performance to-date. On my
machine, this gets about 6.5Mpps drop single core. Drop due to failed
IP lookup (not shown here) is worse @4.5Mpps.

  9.24%  ksoftirqd/3  [mlx4_en]          [k] mlx4_en_process_rx_cq
  8.50%  ksoftirqd/3  [kernel.vmlinux]   [k] dev_gro_receive
  7.24%  ksoftirqd/3  [kernel.vmlinux]   [k] __netif_receive_skb_core
  5.47%  ksoftirqd/3  [mlx4_en]          [k] mlx4_en_complete_rx_desc
  4.74%  ksoftirqd/3  [kernel.vmlinux]   [k] kmem_cache_free
  3.94%  ksoftirqd/3  [mlx4_en]          [k] mlx4_en_alloc_frags
  3.42%  ksoftirqd/3  [kernel.vmlinux]   [k] napi_gro_frags
  3.34%  ksoftirqd/3  [kernel.vmlinux]   [k] inet_gro_receive
  3.32%  ksoftirqd/3  [kernel.vmlinux]   [k] __build_skb
  3.28%  ksoftirqd/3  [kernel.vmlinux]   [k] __napi_alloc_skb
  2.94%  ksoftirqd/3  [cls_bpf]          [k] cls_bpf_classify
  2.88%  ksoftirqd/3  [kernel.vmlinux]   [k] ktime_get_with_offset
  2.50%  ksoftirqd/3  [kernel.vmlinux]   [k] eth_type_trans
  2.40%  ksoftirqd/3  [kernel.vmlinux]   [k] kmem_cache_alloc
  2.29%  ksoftirqd/3  [kernel.vmlinux]   [k] skb_release_data
  2.25%  ksoftirqd/3  [kernel.vmlinux]   [k] gro_pull_from_frag0
  2.09%  ksoftirqd/3  [kernel.vmlinux]   [k] netif_receive_skb_internal
  1.99%  ksoftirqd/3  [kernel.vmlinux]   [k] memcpy_erms
  1.73%  ksoftirqd/3  [kernel.vmlinux]   [k] napi_get_frags
  1.66%  ksoftirqd/3  [kernel.vmlinux]   [k] __udp4_lib_lookup
  1.60%  ksoftirqd/3  [kernel.vmlinux]   [k] tc_classify
  1.25%  ksoftirqd/3  [kernel.vmlinux]   [k] kfree_skb
  1.24%  ksoftirqd/3  [kernel.vmlinux]   [k] get_page_from_freelist
  1.24%  ksoftirqd/3  [kernel.vmlinux]   [k] skb_gro_reset_offset
  1.16%  ksoftirqd/3  [kernel.vmlinux]   [k] udp4_gro_receive
  1.12%  ksoftirqd/3  [kernel.vmlinux]   [k] udp_gro_receive
  0.93%  ksoftirqd/3  [kernel.vmlinux]   [k] __free_page_frag
  0.91%  ksoftirqd/3  [kernel.vmlinux]   [k] skb_release_head_state
  0.89%  ksoftirqd/3  [kernel.vmlinux]   [k] __alloc_page_frag
  0.88%  ksoftirqd/3  [kernel.vmlinux]   [k] udp4_lib_lookup_skb
  0.83%  swapper      [kernel.vmlinux]   [k] intel_idle
  0.81%  ksoftirqd/3  [kernel.vmlinux]   [k] kfree_skbmem
  0.77%  ksoftirqd/3  [kernel.vmlinux]   [k] skb_release_all
  0.76%  ksoftirqd/3  [mlx4_en]          [k] mlx4_en_free_frag
  0.68%  ksoftirqd/3  [kernel.vmlinux]   [k] __netif_receive_skb
  0.64%  ksoftirqd/3  [kernel.vmlinux]   [k] free_pages_prepare
  0.53%  ksoftirqd/3  [kernel.vmlinux]   [k] read_tsc
  0.43%  ksoftirqd/3  [kernel.vmlinux]   [k] swiotlb_sync_single
  0.38%  ksoftirqd/3  [kernel.vmlinux]   [k] __memcpy
  0.37%  ksoftirqd/3  [kernel.vmlinux]   [k] bpf_map_lookup_elem
  0.35%  ksoftirqd/3  [kernel.vmlinux]   [k] __memcg_kmem_put_cache
  0.32%  ksoftirqd/3  [kernel.vmlinux]   [k] swiotlb_sync_single_for_cpu
  0.32%  ksoftirqd/3  [kernel.vmlinux]   [k] free_one_page
  0.25%  ksoftirqd/3  [kernel.vmlinux]   [k] __alloc_pages_nodemask
  0.23%  ksoftirqd/3  [kernel.vmlinux]   [k] net_rx_action
  0.22%  ksoftirqd/3  [kernel.vmlinux]   [k] __free_pages_ok
  0.21%  ksoftirqd/3  [mlx4_en]          [k] mlx4_alloc_pages.isra.23
  0.17%  ksoftirqd/3  [kernel.vmlinux]   [k] percpu_array_map_lookup_elem
  0.17%  ksoftirqd/3  [kernel.vmlinux]   [k] PageHuge
  0.15%  ksoftirqd/3  [wmi]              [k] 0x0000000000005d49
  0.15%  ksoftirqd/3  [kernel.vmlinux]   [k] __rmqueue
  0.13%  ksoftirqd/3  [wmi]              [k] 0x0000000000005d60

> 
> cheers,
> jamal

^ permalink raw reply

* Re: [PATCH net-next 3/6] qed/qede: Add VXLAN tunnel slowpath configuration support
From: Jesse Gross @ 2016-04-09 16:29 UTC (permalink / raw)
  To: Manish Chopra
  Cc: David Miller, Linux Kernel Network Developers, Ariel.Elior,
	Yuval.Mintz
In-Reply-To: <1460207825-3622-4-git-send-email-manish.chopra@qlogic.com>

On Sat, Apr 9, 2016 at 10:17 AM, Manish Chopra <manish.chopra@qlogic.com> wrote:
> diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c b/drivers/net/ethernet/qlogic/qede/qede_main.c
> index 518af32..9a82d42 100644
> --- a/drivers/net/ethernet/qlogic/qede/qede_main.c
> +++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
[...]
> +static netdev_features_t qede_features_check(struct sk_buff *skb,
> +                                            struct net_device *dev,
> +                                            netdev_features_t features)
> +{
> +       return vxlan_features_check(skb, features);
> +}

This is going to restrict the set of protocols that can be offloaded
to those that look exactly like VXLAN. In particular, it means that
you won't be able offload Geneve with options. I don't think this what
you mean to do given that you are supporting these protocols in the
other patches and I know this hardware has more capabilities than just
VXLAN. I think that you want to do a header length check - similar to
what the Intel drivers are doing for example.

I noticed that the bnx2x driver has a similar issue as well.

^ permalink raw reply

* Re: [PATCHv2 net-next 1/6] sctp: add sctp_info dump api for sctp_diag
From: Xin Long @ 2016-04-09 16:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, Vlad Yasevich,
	daniel, davem
In-Reply-To: <1460179179.6473.476.camel@edumazet-glaptop3.roam.corp.google.com>

On Sat, Apr 9, 2016 at 1:19 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2016-04-09 at 12:53 +0800, Xin Long wrote:
>> sctp_diag will dump some important details of sctp's assoc or ep, we use
>> sctp_info to describe them,  sctp_get_sctp_info to get them, and export
>> it to sctp_diag.ko.
>>
>
>
>> +int sctp_get_sctp_info(struct sock *sk, struct sctp_association *asoc,
>> +                    struct sctp_info *info)
>> +{
>> +     struct sctp_transport *prim;
>> +     struct list_head *pos, *temp;
>> +     int mask;
>> +
>> +     memset(info, 0, sizeof(*info));
>> +     if (!asoc) {
>> +             struct sctp_sock *sp = sctp_sk(sk);
>> +
>> +             info->sctpi_s_autoclose = sp->autoclose;
>> +             info->sctpi_s_adaptation_ind = sp->adaptation_ind;
>> +             info->sctpi_s_pd_point = sp->pd_point;
>> +             info->sctpi_s_nodelay = sp->nodelay;
>> +             info->sctpi_s_disable_fragments = sp->disable_fragments;
>> +             info->sctpi_s_v4mapped = sp->v4mapped;
>> +             info->sctpi_s_frag_interleave = sp->frag_interleave;
>> +
>> +             return 0;
>> +     }
>> +
>> +     info->sctpi_tag = asoc->c.my_vtag;
>> +     info->sctpi_state = asoc->state;
>> +     info->sctpi_rwnd = asoc->a_rwnd;
>> +     info->sctpi_unackdata = asoc->unack_data;
>> +     info->sctpi_penddata = sctp_tsnmap_pending(&asoc->peer.tsn_map);
>> +     info->sctpi_instrms = asoc->c.sinit_max_instreams;
>> +     info->sctpi_outstrms = asoc->c.sinit_num_ostreams;
>> +     list_for_each_safe(pos, temp, &asoc->base.inqueue.in_chunk_list)
>> +             info->sctpi_inqueue++;
>> +     list_for_each_safe(pos, temp, &asoc->outqueue.out_chunk_list)
>> +             info->sctpi_outqueue++;
>
> Is this safe ?
>
> Do you own the lock on socket or whatever lock protecting this list ?
>
there are 2 places will call these codes,
1. sctp_diag_dump -> sctp_for_each_transport -> sctp_tsp_dump
this one will use lock_sock to protect them. I think this one is ok.

1. sctp_diag_dump_one -> sctp_transport_lookup_process-> sctp_tsp_dump_one
this one just holds the tsp. and we're using  list_for_each_safe here now,
isn't it enough ?


>
>> +     info->sctpi_overall_error = asoc->overall_error_count;
>> +     info->sctpi_max_burst = asoc->max_burst;
>> +     info->sctpi_maxseg = asoc->frag_point;
>> +     info->sctpi_peer_rwnd = asoc->peer.rwnd;
>> +     info->sctpi_peer_tag = asoc->c.peer_vtag;
>> +
>> +     mask = asoc->peer.ecn_capable << 1;
>> +     mask = (mask | asoc->peer.ipv4_address) << 1;
>> +     mask = (mask | asoc->peer.ipv6_address) << 1;
>> +     mask = (mask | asoc->peer.hostname_address) << 1;
>> +     mask = (mask | asoc->peer.asconf_capable) << 1;
>> +     mask = (mask | asoc->peer.prsctp_capable) << 1;
>> +     mask = (mask | asoc->peer.auth_capable);
>> +     info->sctpi_peer_capable = mask;
>> +     mask = asoc->peer.sack_needed << 1;
>> +     mask = (mask | asoc->peer.sack_generation) << 1;
>> +     mask = (mask | asoc->peer.zero_window_announced);
>> +     info->sctpi_peer_sack = mask;
>> +
>> +     info->sctpi_isacks = asoc->stats.isacks;
>> +     info->sctpi_osacks = asoc->stats.osacks;
>> +     info->sctpi_opackets = asoc->stats.opackets;
>> +     info->sctpi_ipackets = asoc->stats.ipackets;
>> +     info->sctpi_rtxchunks = asoc->stats.rtxchunks;
>> +     info->sctpi_outofseqtsns = asoc->stats.outofseqtsns;
>> +     info->sctpi_idupchunks = asoc->stats.idupchunks;
>> +     info->sctpi_gapcnt = asoc->stats.gapcnt;
>> +     info->sctpi_ouodchunks = asoc->stats.ouodchunks;
>> +     info->sctpi_iuodchunks = asoc->stats.iuodchunks;
>> +     info->sctpi_oodchunks = asoc->stats.oodchunks;
>> +     info->sctpi_iodchunks = asoc->stats.iodchunks;
>> +     info->sctpi_octrlchunks = asoc->stats.octrlchunks;
>> +     info->sctpi_ictrlchunks = asoc->stats.ictrlchunks;
>> +
>> +     prim = asoc->peer.primary_path;
>> +     memcpy(&info->sctpi_p_address, &prim->ipaddr,
>> +            sizeof(struct sockaddr_storage));
>> +     info->sctpi_p_state = prim->state;
>> +     info->sctpi_p_cwnd = prim->cwnd;
>> +     info->sctpi_p_srtt = prim->srtt;
>> +     info->sctpi_p_rto = jiffies_to_msecs(prim->rto);
>> +     info->sctpi_p_hbinterval = prim->hbinterval;
>> +     info->sctpi_p_pathmaxrxt = prim->pathmaxrxt;
>> +     info->sctpi_p_sackdelay = jiffies_to_msecs(prim->sackdelay);
>> +     info->sctpi_p_ssthresh = prim->ssthresh;
>> +     info->sctpi_p_partial_bytes_acked = prim->partial_bytes_acked;
>> +     info->sctpi_p_flight_size = prim->flight_size;
>> +     info->sctpi_p_error = prim->error_count;
>> +
>> +     return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(sctp_get_sctp_info);
>
> info is not guaranteed to be aligned on 8 bytes.
>
> You need to use put_unaligned()
>
> Check commit ff5d749772018 ("tcp: beware of alignments in
> tcp_get_info()") for details.
Ok,  I will.

>
>
>

^ permalink raw reply

* Re: [PATCH net-next 1/6] net: Make vxlan/geneve default udp ports public
From: Jesse Gross @ 2016-04-09 16:06 UTC (permalink / raw)
  To: Manish Chopra
  Cc: David Miller, Linux Kernel Network Developers, Ariel.Elior,
	Yuval.Mintz
In-Reply-To: <1460207825-3622-2-git-send-email-manish.chopra@qlogic.com>

On Sat, Apr 9, 2016 at 10:17 AM, Manish Chopra <manish.chopra@qlogic.com> wrote:
> Rationale behind this change is that with some OVS configuration
> UDP ports doesn't get notified to the driver using
> .ndo_[add|del]_vxlan_port. So for the driver to work with
> these specific ports in that environment we need to have them configured
> on adapter by default for the required hardware offload support.

I think you are referring to old out of tree code - no version of
upstream OVS does this. In addition, any old code won't work against
the new kernels that would include this driver update anyways so there
won't be a benefit in any case.

Please just use the normal registration mechanism that is already
exposed. I also noticed that in the Geneve case you aren't currently
registering for port notifications and just using the assigned port
number in all cases, which isn't right.

^ permalink raw reply

* Re: [RFC PATCH 07/11] GENEVE: Add option to mangle IP IDs on inner headers when using TSO
From: Jesse Gross @ 2016-04-09 15:52 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Herbert Xu, Tom Herbert, Eric Dumazet,
	Linux Kernel Network Developers, David Miller
In-Reply-To: <CAKgT0UcehJaKhEQCATuyp48Q0gqofk7JjwsakNo1KgJd2YH-TA@mail.gmail.com>

On Fri, Apr 8, 2016 at 7:04 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Fri, Apr 8, 2016 at 2:40 PM, Jesse Gross <jesse@kernel.org> wrote:
>> Maybe I missed it but I didn't see any checks for the DF bit being set
>> when we transmit a packet with NETIF_F_TSO_MANGLEID. Even if I am
>> comfortable mangling my IDs in the DF case, I don't think this would
>> ever extend to non-DF packets. In the documentation you noted that it
>> is the driver's responsibility to do this check but I couldn't find it
>> in either ixgbe or igb. It would also be nice if the core stack could
>> enforce it somehow as well rather than each driver.
>
> Yeah I had glossed over that in the igb and ixgbe patches.  A check is
> only really needed for the incrementing to non-incrementing case and I
> wasn't sure how common it was to have TCP with an IP header that
> didn't set the DF bit.  In the case of the outer headers igb and ixgbe
> will increment the IP ID always so we don't have to worry about if DF
> is set of not there.  For the inner headers I had fudged it a bit and
> didn't add the validation.  If needed I can see about adding that
> shortly.

TCP without the DF bit set is not the default but it is possible (it
can be enabled by setting /proc/sys/net/ipv4/ip_no_pmtu_disc). I also
did a quick check of some Internet services and at least some of them
seem to return TCP without DF, so it's not too rare.

^ permalink raw reply

* Re: [PATCHv2 net-next 1/6] sctp: add sctp_info dump api for sctp_diag
From: Xin Long @ 2016-04-09 15:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, Vlad Yasevich,
	daniel, davem
In-Reply-To: <1460178984.6473.473.camel@edumazet-glaptop3.roam.corp.google.com>

On Sat, Apr 9, 2016 at 1:16 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2016-04-09 at 12:53 +0800, Xin Long wrote:
>> sctp_diag will dump some important details of sctp's assoc or ep, we use
>> sctp_info to describe them,  sctp_get_sctp_info to get them, and export
>> it to sctp_diag.ko.
>>
>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
>> ---
>>  include/linux/sctp.h    | 65 +++++++++++++++++++++++++++++++++++++
>>  include/net/sctp/sctp.h |  3 ++
>>  net/sctp/socket.c       | 86 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 154 insertions(+)
>>
>> diff --git a/include/linux/sctp.h b/include/linux/sctp.h
>> index a9414fd..a448ebc 100644
>> --- a/include/linux/sctp.h
>> +++ b/include/linux/sctp.h
>> @@ -705,4 +705,69 @@ typedef struct sctp_auth_chunk {
>>       sctp_authhdr_t auth_hdr;
>>  } __packed sctp_auth_chunk_t;
>>
>> +struct sctp_info {
>> +     __u32   sctpi_tag;
>> +     __u32   sctpi_state;
>> +     __u32   sctpi_rwnd;
>> +     __u16   sctpi_unackdata;
>> +     __u16   sctpi_penddata;
>> +     __u16   sctpi_instrms;
>> +     __u16   sctpi_outstrms;
>> +     __u32   sctpi_fragmentation_point;
>> +     __u32   sctpi_inqueue;
>> +     __u32   sctpi_outqueue;
>> +     __u32   sctpi_overall_error;
>> +     __u32   sctpi_max_burst;
>> +     __u32   sctpi_maxseg;
>> +     __u32   sctpi_peer_rwnd;
>> +     __u32   sctpi_peer_tag;
>> +     __u8    sctpi_peer_capable;
>> +     __u8    sctpi_peer_sack;
>> +
>> +     /* assoc status info */
>> +     __u64   sctpi_isacks;
>> +     __u64   sctpi_osacks;
>> +     __u64   sctpi_opackets;
>> +     __u64   sctpi_ipackets;
>> +     __u64   sctpi_rtxchunks;
>> +     __u64   sctpi_outofseqtsns;
>> +     __u64   sctpi_idupchunks;
>> +     __u64   sctpi_gapcnt;
>> +     __u64   sctpi_ouodchunks;
>> +     __u64   sctpi_iuodchunks;
>> +     __u64   sctpi_oodchunks;
>> +     __u64   sctpi_iodchunks;
>> +     __u64   sctpi_octrlchunks;
>> +     __u64   sctpi_ictrlchunks;
>> +
>> +     /* primary transport info */
>> +     struct sockaddr_storage sctpi_p_address;
>> +     __s32   sctpi_p_state;
>> +     __u32   sctpi_p_cwnd;
>> +     __u32   sctpi_p_srtt;
>> +     __u32   sctpi_p_rto;
>> +     __u32   sctpi_p_hbinterval;
>> +     __u32   sctpi_p_pathmaxrxt;
>> +     __u32   sctpi_p_sackdelay;
>> +     __u32   sctpi_p_sackfreq;
>> +     __u32   sctpi_p_ssthresh;
>> +     __u32   sctpi_p_partial_bytes_acked;
>> +     __u32   sctpi_p_flight_size;
>> +     __u16   sctpi_p_error;
>> +
>> +     /* sctp sock info */
>> +     __u32   sctpi_s_autoclose;
>> +     __u32   sctpi_s_adaptation_ind;
>> +     __u32   sctpi_s_pd_point;
>> +     __u8    sctpi_s_nodelay;
>> +     __u8    sctpi_s_disable_fragments;
>> +     __u8    sctpi_s_v4mapped;
>> +     __u8    sctpi_s_frag_interleave;
>> +};
>> +
>
> Lots of holes in this structure...
>
>
will check and improve it later.
thanks.

^ permalink raw reply

* Re: [next-queue PATCH 0/3] Add support for GSO partial to Intel NIC drivers
From: Alexander Duyck @ 2016-04-09 15:41 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: Alexander Duyck, Herbert Xu, Tom Herbert, Jesse Gross,
	Eric Dumazet, intel-wired-lan, Netdev, David Miller
In-Reply-To: <1460185149.2982.6.camel@intel.com>

On Fri, Apr 8, 2016 at 11:59 PM, Jeff Kirsher
<jeffrey.t.kirsher@intel.com> wrote:
> On Fri, 2016-04-08 at 17:06 -0400, Alexander Duyck wrote:
>> So these are the patches needed to enable tunnel segmentation
>> offloads on
>> the igb, igbvf, ixgbe, and ixgbevf drivers.  In addition this patch
>> extends
>> the i40e and i40evf drivers to include segmentation support for
>> tunnels
>> with outer checksums.
>>
>> The net performance gain for these patches are pretty significant.
>> In the
>> case of i40e a tunnel with outer checksums showed the following
>> improvement:
>> Throughput Throughput  Local Local   Result
>>            Units       CPU   Service Tag
>>                        Util  Demand
>>                        %
>> 14066.29   10^6bits/s  3.49  0.651   "before"
>> 20618.16   10^6bits/s  3.09  0.393   "after"
>>
>> For ixgbe similar results were seen:
>> Throughput Throughput  Local  Local   Result
>>            Units       CPU    Service Tag
>>                        Util   Demand
>>                        %
>> 12879.89   10^6bits/s  10.00  0.763   "before"
>> 14286.77   10^6bits/s  5.74   0.395   "after"
>>
>> These patches all rely on the TSO_MANGLEID and GSO_PARTIAL patches so
>> I
>> would not recommend applying them until those patches have first been
>> applied.
>
> Sorry I did not see this until after I tried applying your series. :-(
>
> Maybe the two dependent patches should have been in the series, so I
> and others do not waste their time.  Or not send this until the two
> patches were accepted.

Sorry I meant to send these as an RFC but sent it out with the
next-queue tag as I had gotten a bit distracted.

I shouldn't need to resubmit these until the other patches are
accepted so I will probably follow that route.

Thanks.

- Alex

^ permalink raw reply

* Re: [PATCHv2 net-next 4/6] sctp: add the sctp_diag.c file
From: Xin Long @ 2016-04-09 15:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, Vlad Yasevich,
	daniel, davem
In-Reply-To: <1460181116.6473.483.camel@edumazet-glaptop3.roam.corp.google.com>

On Sat, Apr 9, 2016 at 1:51 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2016-04-09 at 12:53 +0800, Xin Long wrote:
>> This one will implement all the interface of inet_diag, inet_diag_handler.
>> which includes sctp_diag_dump, sctp_diag_dump_one and sctp_diag_get_info.
>
>
>> +static int inet_assoc_diag_fill(struct sock *sk,
>> +                             struct sctp_association *asoc,
>> +                             struct sk_buff *skb,
>> +                             const struct inet_diag_req_v2 *req,
>> +                             struct user_namespace *user_ns,
>> +                             int portid, u32 seq, u16 nlmsg_flags,
>> +                             const struct nlmsghdr *unlh)
>> +{
>> +     const struct inet_sock *inet = inet_sk(sk);
>> +     const struct inet_diag_handler *handler;
>> +     int ext = req->idiag_ext;
>> +     struct inet_diag_msg *r;
>> +     struct nlmsghdr  *nlh;
>> +     struct nlattr *attr;
>> +     void *info = NULL;
>> +     union sctp_addr laddr, paddr;
>> +     struct dst_entry *dst;
>> +     struct sctp_infox infox;
>> +
>> +     handler = inet_diag_get_handler(req->sdiag_protocol);
>> +     BUG_ON(!handler);
>> +
>> +     nlh = nlmsg_put(skb, portid, seq, unlh->nlmsg_type, sizeof(*r),
>> +                     nlmsg_flags);
>> +     if (!nlh)
>> +             return -EMSGSIZE;
>> +
>> +     r = nlmsg_data(nlh);
>> +     BUG_ON(!sk_fullsock(sk));
>> +
>> +     laddr = list_entry(asoc->base.bind_addr.address_list.next,
>> +                        struct sctp_sockaddr_entry, list)->a;
>> +     paddr = asoc->peer.primary_path->ipaddr;
>> +     dst = asoc->peer.primary_path->dst;
>> +
>> +     r->idiag_family = sk->sk_family;
>> +     r->id.idiag_sport = htons(asoc->base.bind_addr.port);
>> +     r->id.idiag_dport = htons(asoc->peer.port);
>> +     r->id.idiag_if = dst ? dst->dev->ifindex : 0;
>> +     sock_diag_save_cookie(sk, r->id.idiag_cookie);
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +     if (sk->sk_family == AF_INET6) {
>> +             *(struct in6_addr *)r->id.idiag_src = laddr.v6.sin6_addr;
>> +             *(struct in6_addr *)r->id.idiag_dst = paddr.v6.sin6_addr;
>> +     } else
>> +#endif
>> +     {
>> +             memset(&r->id.idiag_src, 0, sizeof(r->id.idiag_src));
>> +             memset(&r->id.idiag_dst, 0, sizeof(r->id.idiag_dst));
>> +
>> +             r->id.idiag_src[0] = laddr.v4.sin_addr.s_addr;
>> +             r->id.idiag_dst[0] = paddr.v4.sin_addr.s_addr;
>> +     }
>> +
>> +     r->idiag_state = asoc->state;
>> +     r->idiag_timer = SCTP_EVENT_TIMEOUT_T3_RTX;
>> +     r->idiag_retrans = asoc->rtx_data_chunks;
>> +#define EXPIRES_IN_MS(tmo)  DIV_ROUND_UP((tmo - jiffies) * 1000, HZ)
>> +     r->idiag_expires =
>> +             EXPIRES_IN_MS(asoc->timeouts[SCTP_EVENT_TIMEOUT_T3_RTX]);
>> +#undef EXPIRES_IN_MS
>> +
>> +     if (nla_put_u8(skb, INET_DIAG_SHUTDOWN, sk->sk_shutdown))
>> +             goto errout;
>> +
>> +     /* IPv6 dual-stack sockets use inet->tos for IPv4 connections,
>> +      * hence this needs to be included regardless of socket family.
>> +      */
>> +     if (ext & (1 << (INET_DIAG_TOS - 1)))
>> +             if (nla_put_u8(skb, INET_DIAG_TOS, inet->tos) < 0)
>> +                     goto errout;
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +     if (r->idiag_family == AF_INET6) {
>> +             if (ext & (1 << (INET_DIAG_TCLASS - 1)))
>> +                     if (nla_put_u8(skb, INET_DIAG_TCLASS,
>> +                                    inet6_sk(sk)->tclass) < 0)
>> +                             goto errout;
>> +
>> +             if (((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) &&
>> +                 nla_put_u8(skb, INET_DIAG_SKV6ONLY, ipv6_only_sock(sk)))
>> +                     goto errout;
>> +     }
>> +#endif
>> +
>> +     r->idiag_uid = from_kuid_munged(user_ns, sock_i_uid(sk));
>> +     r->idiag_inode = sock_i_ino(sk);
>> +
>> +     if (ext & (1 << (INET_DIAG_MEMINFO - 1))) {
>> +             struct inet_diag_meminfo minfo = {
>> +                     .idiag_rmem = sk_rmem_alloc_get(sk),
>> +                     .idiag_wmem = sk->sk_wmem_queued,
>> +                     .idiag_fmem = sk->sk_forward_alloc,
>> +                     .idiag_tmem = sk_wmem_alloc_get(sk),
>> +             };
>> +
>
> All this code looks familiar.
>
> Why inet_sk_diag_fill() is not used instead ?
>
it's hard to reuse  inet_sk_diag_fill(), cause some of them are from
assoc.

yes, there are some duplicate codes. if we want to avoid this.
we have to extract new function for this part, and it will change
more in inet_diag.


>> +             if (nla_put(skb, INET_DIAG_MEMINFO, sizeof(minfo), &minfo) < 0)
>> +                     goto errout;
>> +     }
>> +
>> +     if (ext & (1 << (INET_DIAG_SKMEMINFO - 1)))
>> +             if (sock_diag_put_meminfo(sk, skb, INET_DIAG_SKMEMINFO))
>> +                     goto errout;
>> +
>> +     if ((ext & (1 << (INET_DIAG_INFO - 1))) && handler->idiag_info_size) {
>> +             attr = nla_reserve(skb, INET_DIAG_INFO,
>> +                                handler->idiag_info_size);
>> +             if (!attr)
>> +                     goto errout;
>> +
>> +             info = nla_data(attr);
>> +     }
>> +     infox.sctpinfo = (struct sctp_info *)info;
>> +     infox.asoc = asoc;
>> +     handler->idiag_get_info(sk, r, &infox);
>> +
>> +     if (ext & (1 << (INET_DIAG_CONG - 1)))
>> +             if (nla_put_string(skb, INET_DIAG_CONG, "reno") < 0)
>> +                     goto errout;
>> +
>> +     if (inet_sctp_fill_laddrs(skb, &asoc->base.bind_addr.address_list))
>> +             goto errout;
>> +
>> +     if (inet_sctp_fill_paddrs(skb, asoc))
>> +             goto errout;
>> +
>> +     nlmsg_end(skb, nlh);
>> +     return 0;
>> +
>> +errout:
>> +     nlmsg_cancel(skb, nlh);
>> +     return -EMSGSIZE;
>> +}
>> +
>> +static int inet_ep_diag_fill(struct sock *sk, struct sctp_endpoint *ep,
>> +                          struct sk_buff *skb,
>> +                          const struct inet_diag_req_v2 *req,
>> +                          struct user_namespace *user_ns,
>> +                          u32 portid, u32 seq, u16 nlmsg_flags,
>> +                          const struct nlmsghdr *unlh)
>> +{
>> +     const struct inet_sock *inet = inet_sk(sk);
>> +     const struct inet_diag_handler *handler;
>> +     int ext = req->idiag_ext;
>> +     struct inet_diag_msg *r;
>> +     struct nlmsghdr  *nlh;
>> +     struct nlattr *attr;
>> +     void *info = NULL;
>> +     struct sctp_infox infox;
>> +
>> +     handler = inet_diag_get_handler(req->sdiag_protocol);
>> +     BUG_ON(!handler);
>> +
>> +     nlh = nlmsg_put(skb, portid, seq, unlh->nlmsg_type, sizeof(*r),
>> +                     nlmsg_flags);
>> +     if (!nlh)
>> +             return -EMSGSIZE;
>> +
>> +     r = nlmsg_data(nlh);
>> +     BUG_ON(!sk_fullsock(sk));
>> +
>> +     inet_diag_msg_common_fill(r, sk);
>> +     r->idiag_state = sk->sk_state;
>> +     r->idiag_timer = 0;
>> +     r->idiag_retrans = 0;
>> +
>> +     if (nla_put_u8(skb, INET_DIAG_SHUTDOWN, sk->sk_shutdown))
>> +             goto errout;
>> +
>> +     /* IPv6 dual-stack sockets use inet->tos for IPv4 connections,
>> +      * hence this needs to be included regardless of socket family.
>> +      */
>> +     if (ext & (1 << (INET_DIAG_TOS - 1)))
>> +             if (nla_put_u8(skb, INET_DIAG_TOS, inet->tos) < 0)
>> +                     goto errout;
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +     if (r->idiag_family == AF_INET6) {
>> +             if (ext & (1 << (INET_DIAG_TCLASS - 1)))
>> +                     if (nla_put_u8(skb, INET_DIAG_TCLASS,
>> +                                    inet6_sk(sk)->tclass) < 0)
>> +                             goto errout;
>> +
>> +             if (((1 << sk->sk_state) & (TCPF_LISTEN | TCPF_CLOSE)) &&
>> +                 nla_put_u8(skb, INET_DIAG_SKV6ONLY, ipv6_only_sock(sk)))
>> +                     goto errout;
>> +     }
>> +#endif
>> +
>> +     r->idiag_uid = from_kuid_munged(user_ns, sock_i_uid(sk));
>> +     r->idiag_inode = sock_i_ino(sk);
>> +
>> +     if (ext & (1 << (INET_DIAG_MEMINFO - 1))) {
>> +             struct inet_diag_meminfo minfo = {
>> +                     .idiag_rmem = sk_rmem_alloc_get(sk),
>> +                     .idiag_wmem = sk->sk_wmem_queued,
>> +                     .idiag_fmem = sk->sk_forward_alloc,
>> +                     .idiag_tmem = sk_wmem_alloc_get(sk),
>> +             };
>> +
>
> Again, looks a lot of duplication.
>
> Also you missed that INET_DIAG_MEMINFO is kind of obsolete,
> now we have sock_diag_put_meminfo()
>
you meant we can remove it here ?
yes, it seems similar with INET_DIAG_SKMEMINFO.
but I do not know if userpace may use  INET_DIAG_MEMINFO now.

>
>> +             if (nla_put(skb, INET_DIAG_MEMINFO, sizeof(minfo), &minfo) < 0)
>> +                     goto errout;
>> +     }
>> +
>> +     if (ext & (1 << (INET_DIAG_SKMEMINFO - 1)))
>> +             if (sock_diag_put_meminfo(sk, skb, INET_DIAG_SKMEMINFO))
>> +                     goto errout;
>> +
>> +     if ((ext & (1 << (INET_DIAG_INFO - 1))) && handler->idiag_info_size) {
>> +             attr = nla_reserve(skb, INET_DIAG_INFO,
>> +                                handler->idiag_info_size);
>> +             if (!attr)
>> +                     goto errout;
>> +
>> +             info = nla_data(attr);
>> +     }
>> +     infox.sctpinfo = (struct sctp_info *)info;
>> +     infox.asoc = NULL;
>> +     handler->idiag_get_info(sk, r, &infox);
>> +
>> +     if (inet_sctp_fill_laddrs(skb, &ep->base.bind_addr.address_list))
>> +             goto errout;
>> +
>> +     nlmsg_end(skb, nlh);
>> +     return 0;
>> +
>> +errout:
>> +     nlmsg_cancel(skb, nlh);
>> +     return -EMSGSIZE;
>> +}
>> +
>> +static size_t inet_assoc_attr_size(struct sctp_association *asoc)
>> +{
>> +     int addrlen = sizeof(struct sockaddr_storage);
>> +     int addrcnt = 0;
>> +     struct sctp_sockaddr_entry *laddr;
>> +
>> +     list_for_each_entry_rcu(laddr, &asoc->base.bind_addr.address_list,
>> +                             list)
>> +             addrcnt++;
>> +
>> +     return    nla_total_size(sizeof(struct tcp_info))
>
> Are you sure you want to use tcp_info ???
here is a mistake, it should be:
    nla_total_size(sizeof(struct sctp_info))

thanks.
>
>> +             + nla_total_size(1) /* INET_DIAG_SHUTDOWN */
>> +             + nla_total_size(1) /* INET_DIAG_TOS */
>> +             + nla_total_size(1) /* INET_DIAG_TCLASS */
>> +             + nla_total_size(addrlen * asoc->peer.transport_count)
>> +             + nla_total_size(addrlen * addrcnt)
>> +             + nla_total_size(sizeof(struct inet_diag_meminfo))
>> +             + nla_total_size(sizeof(struct inet_diag_msg))
>> +             + nla_total_size(sizeof(struct sctp_info))
and will remove this one.

>> +             + 64;
>> +}
>
>
>

^ permalink raw reply

* Re: [RFC PATCH v2 1/5] bpf: add PHYS_DEV prog type for early driver filter
From: Jamal Hadi Salim @ 2016-04-09 15:29 UTC (permalink / raw)
  To: Tom Herbert, Alexei Starovoitov
  Cc: Jesper Dangaard Brouer, Brenden Blanco, David S. Miller,
	Linux Kernel Network Developers, Or Gerlitz, Daniel Borkmann,
	Eric Dumazet, Edward Cree, john fastabend, Thomas Graf,
	Johannes Berg, eranlinuxmellanox, Lorenzo Colitti, linux-mm
In-Reply-To: <CALx6S36d74D-8Rx762nmNwb1TF0M0sBfojBhUF96prJiYmDYiQ@mail.gmail.com>

On 16-04-09 07:29 AM, Tom Herbert wrote:

> +1. Forwarding which will be a common application almost always
> requires modification (decrement TTL), and header data split has
> always been a weak feature since the device has to have some arbitrary
> rules about what headers needs to be split out (either implements
> protocol specific parsing or some fixed length).

Then this is sensible. I was cruising the threads and
confused by your earlier emails Tom because you talked
about XPS etc. It sounded like the idea evolved into putting
the whole freaking stack on bpf.
If this is _forwarding only_ it maybe useful to look at
Alexey's old code in particular the DMA bits;
he built his own lookup algorithm but sounds like bpf is
a much better fit today.

cheers,
jamal

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCHv2 net-next 1/6] sctp: add sctp_info dump api for sctp_diag
From: Jamal Hadi Salim @ 2016-04-09 15:19 UTC (permalink / raw)
  To: Eric Dumazet, Xin Long
  Cc: network dev, linux-sctp, Marcelo Ricardo Leitner, Vlad Yasevich,
	daniel, davem
In-Reply-To: <1460178984.6473.473.camel@edumazet-glaptop3.roam.corp.google.com>

On 16-04-09 01:16 AM, Eric Dumazet wrote:

>
> Lots of holes in this structure...
>
>

I may have mentioned to you that there is 8 bit hole in tcp_info too ;->
(above tcpi_rto). Adding an 8 bit explicit pad seems useful
since it is already in the wild. I was going to send the patch after
netdev11 but  forgot.

cheers,
jamal

^ permalink raw reply

* Re: [PATCHv2 net-next 1/6] sctp: add sctp_info dump api for sctp_diag
From: Jamal Hadi Salim @ 2016-04-09 15:16 UTC (permalink / raw)
  To: Xin Long, network dev, linux-sctp
  Cc: Marcelo Ricardo Leitner, Vlad Yasevich, daniel, davem
In-Reply-To: <c507274a984bd1b0a7e7a59d1e825352536efd25.1460177331.git.lucien.xin@gmail.com>

Appreciate these patches. Finally some love for sctp.
Small comment below:

On 16-04-09 12:53 AM, Xin Long wrote:
> sctp_diag will dump some important details of sctp's assoc or ep, we use
> sctp_info to describe them,  sctp_get_sctp_info to get them, and export
> it to sctp_diag.ko.
>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>   include/linux/sctp.h    | 65 +++++++++++++++++++++++++++++++++++++
>   include/net/sctp/sctp.h |  3 ++
>   net/sctp/socket.c       | 86 +++++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 154 insertions(+)
>
> diff --git a/include/linux/sctp.h b/include/linux/sctp.h
> index a9414fd..a448ebc 100644
> --- a/include/linux/sctp.h
> +++ b/include/linux/sctp.h
> @@ -705,4 +705,69 @@ typedef struct sctp_auth_chunk {
>   	sctp_authhdr_t auth_hdr;
>   } __packed sctp_auth_chunk_t;
>
> +struct sctp_info {
> +	__u32	sctpi_tag;
> +	__u32	sctpi_state;
> +	__u32	sctpi_rwnd;
> +	__u16	sctpi_unackdata;
> +	__u16	sctpi_penddata;
> +	__u16	sctpi_instrms;
> +	__u16	sctpi_outstrms;
> +	__u32	sctpi_fragmentation_point;
> +	__u32	sctpi_inqueue;
> +	__u32	sctpi_outqueue;
> +	__u32	sctpi_overall_error;
> +	__u32	sctpi_max_burst;
> +	__u32	sctpi_maxseg;
> +	__u32	sctpi_peer_rwnd;
> +	__u32	sctpi_peer_tag;
> +	__u8	sctpi_peer_capable;
> +	__u8	sctpi_peer_sack;
> +
> +	/* assoc status info */
> +	__u64	sctpi_isacks;
> +	__u64	sctpi_osacks;
> +	__u64	sctpi_opackets;
> +	__u64	sctpi_ipackets;
> +	__u64	sctpi_rtxchunks;
> +	__u64	sctpi_outofseqtsns;
> +	__u64	sctpi_idupchunks;
> +	__u64	sctpi_gapcnt;
> +	__u64	sctpi_ouodchunks;
> +	__u64	sctpi_iuodchunks;
> +	__u64	sctpi_oodchunks;
> +	__u64	sctpi_iodchunks;
> +	__u64	sctpi_octrlchunks;
> +	__u64	sctpi_ictrlchunks;
> +
> +	/* primary transport info */
> +	struct sockaddr_storage	sctpi_p_address;
> +	__s32	sctpi_p_state;
> +	__u32	sctpi_p_cwnd;
> +	__u32	sctpi_p_srtt;
> +	__u32	sctpi_p_rto;
> +	__u32	sctpi_p_hbinterval;
> +	__u32	sctpi_p_pathmaxrxt;
> +	__u32	sctpi_p_sackdelay;
> +	__u32	sctpi_p_sackfreq;
> +	__u32	sctpi_p_ssthresh;
> +	__u32	sctpi_p_partial_bytes_acked;
> +	__u32	sctpi_p_flight_size;
> +	__u16	sctpi_p_error;
> +
> +	/* sctp sock info */
> +	__u32	sctpi_s_autoclose;
> +	__u32	sctpi_s_adaptation_ind;
> +	__u32	sctpi_s_pd_point;
> +	__u8	sctpi_s_nodelay;
> +	__u8	sctpi_s_disable_fragments;
> +	__u8	sctpi_s_v4mapped;
> +	__u8	sctpi_s_frag_interleave;
> +};
> +


Can you double check to make sure this is 32 bit aligned
(no holes) maybe in your case 64 bit aligned?
Sticking  +	__u16	sctpi_p_error in there seems
to kill it.

Also, any plans to do the netlink events and destroy features?

cheers,
jamal

^ permalink raw reply

* [PATCH net-next] ipv6: fix inet6_lookup_listener()
From: Eric Dumazet @ 2016-04-09 15:01 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Maciej Żenczykowski

From: Eric Dumazet <edumazet@google.com>

A stupid refactoring bug in inet6_lookup_listener() needs to be fixed
in order to get proper SO_REUSEPORT behavior.

Fixes: 3b24d854cb35 ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
---
 net/ipv6/inet6_hashtables.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 607da088344d..f1678388fb0d 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -137,7 +137,7 @@ struct sock *inet6_lookup_listener(struct net *net,
 	sk_for_each(sk, &ilb->head) {
 		score = compute_score(sk, net, hnum, daddr, dif);
 		if (score > hiscore) {
-			hiscore = score;
+			reuseport = sk->sk_reuseport;
 			if (reuseport) {
 				phash = inet6_ehashfn(net, daddr, hnum,
 						      saddr, sport);
@@ -148,7 +148,7 @@ struct sock *inet6_lookup_listener(struct net *net,
 				matches = 1;
 			}
 			result = sk;
-			reuseport = sk->sk_reuseport;
+			hiscore = score;
 		} else if (score == hiscore && reuseport) {
 			matches++;
 			if (reciprocal_scale(phash, matches) == 0)

^ permalink raw reply related

* Re: [RFC PATCH v2 5/5] Add sample for adding simple drop program to link
From: Jamal Hadi Salim @ 2016-04-09 14:48 UTC (permalink / raw)
  To: Brenden Blanco, davem
  Cc: netdev, tom, alexei.starovoitov, ogerlitz, daniel, brouer,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo
In-Reply-To: <1460090930-11219-5-git-send-email-bblanco@plumgrid.com>

On 16-04-08 12:48 AM, Brenden Blanco wrote:
> Add a sample program that only drops packets at the
> BPF_PROG_TYPE_PHYS_DEV hook of a link. With the drop-only program,
> observed single core rate is ~19.5Mpps.
>
> Other tests were run, for instance without the dropcnt increment or
> without reading from the packet header, the packet rate was mostly
> unchanged.
>
> $ perf record -a samples/bpf/netdrvx1 $(</sys/class/net/eth0/ifindex)
> proto 17:   19596362 drops/s
>
> ./pktgen_sample03_burst_single_flow.sh -i $DEV -d $IP -m $MAC -t 4
> Running... ctrl^C to stop
> Device: eth4@0
> Result: OK: 7873817(c7872245+d1572) usec, 38801823 (60byte,0frags)
>    4927955pps 2365Mb/sec (2365418400bps) errors: 0
> Device: eth4@1
> Result: OK: 7873817(c7872123+d1693) usec, 38587342 (60byte,0frags)
>    4900715pps 2352Mb/sec (2352343200bps) errors: 0
> Device: eth4@2
> Result: OK: 7873817(c7870929+d2888) usec, 38718848 (60byte,0frags)
>    4917417pps 2360Mb/sec (2360360160bps) errors: 0
> Device: eth4@3
> Result: OK: 7873818(c7872193+d1625) usec, 38796346 (60byte,0frags)
>    4927259pps 2365Mb/sec (2365084320bps) errors: 0
>
> perf report --no-children:
>   29.48%  ksoftirqd/6  [mlx4_en]         [k] mlx4_en_process_rx_cq
>   18.17%  ksoftirqd/6  [mlx4_en]         [k] mlx4_en_alloc_frags
>    8.19%  ksoftirqd/6  [mlx4_en]         [k] mlx4_en_free_frag
>    5.35%  ksoftirqd/6  [kernel.vmlinux]  [k] get_page_from_freelist
>    2.92%  ksoftirqd/6  [kernel.vmlinux]  [k] free_pages_prepare
>    2.90%  ksoftirqd/6  [mlx4_en]         [k] mlx4_call_bpf
>    2.72%  ksoftirqd/6  [fjes]            [k] 0x000000000000af66
>    2.37%  ksoftirqd/6  [kernel.vmlinux]  [k] swiotlb_sync_single_for_cpu
>    1.92%  ksoftirqd/6  [kernel.vmlinux]  [k] percpu_array_map_lookup_elem
>    1.83%  ksoftirqd/6  [kernel.vmlinux]  [k] free_one_page
>    1.70%  ksoftirqd/6  [kernel.vmlinux]  [k] swiotlb_sync_single
>    1.69%  ksoftirqd/6  [kernel.vmlinux]  [k] bpf_map_lookup_elem
>    1.33%  swapper      [kernel.vmlinux]  [k] intel_idle
>    1.32%  ksoftirqd/6  [fjes]            [k] 0x000000000000af90
>    1.21%  ksoftirqd/6  [kernel.vmlinux]  [k] sk_load_byte_positive_offset
>    1.07%  ksoftirqd/6  [kernel.vmlinux]  [k] __alloc_pages_nodemask
>    0.89%  ksoftirqd/6  [kernel.vmlinux]  [k] __rmqueue
>    0.84%  ksoftirqd/6  [mlx4_en]         [k] mlx4_alloc_pages.isra.23
>    0.79%  ksoftirqd/6  [kernel.vmlinux]  [k] net_rx_action
>
> machine specs:
>   receiver - Intel E5-1630 v3 @ 3.70GHz
>   sender - Intel E5645 @ 2.40GHz
>   Mellanox ConnectX-3 @40G
>


Ok, sorry - should have looked this far before sending earlier email.
So when you run concurently you see about 5Mpps per core but if you
shoot all traffic at a single core you see 20Mpps?

Devil's advocate question:
If the bottleneck is the driver - is there an advantage in adding the
bpf code at all in the driver?
I am curious than before to see the comparison for the same bpf code
running at tc level vs in the driver..

cheers,
jamal

^ permalink raw reply

* Re: [RFC PATCH v2 0/5] Add driver bpf hook for early packet drop
From: Jamal Hadi Salim @ 2016-04-09 14:37 UTC (permalink / raw)
  To: Brenden Blanco, davem
  Cc: netdev, tom, alexei.starovoitov, ogerlitz, daniel, brouer,
	eric.dumazet, ecree, john.fastabend, tgraf, johannes,
	eranlinuxmellanox, lorenzo
In-Reply-To: <1460090874-10497-1-git-send-email-bblanco@plumgrid.com>

On 16-04-08 12:47 AM, Brenden Blanco wrote:
> This patch set introduces new infrastructure for programmatically
> processing packets in the earliest stages of rx, as part of an effort
> others are calling Express Data Path (XDP) [1]. Start this effort by
> introducing a new bpf program type for early packet filtering, before even
> an skb has been allocated.
>
> With this, hope to enable line rate filtering, with this initial
> implementation providing drop/allow action only.
>
> Patch 1 introduces the new prog type and helpers for validating the bpf
> program. A new userspace struct is defined containing only len as a field,
> with others to follow in the future.
> In patch 2, create a new ndo to pass the fd to support drivers.
> In patch 3, expose a new rtnl option to userspace.
> In patch 4, enable support in mlx4 driver. No skb allocation is required,
> instead a static percpu skb is kept in the driver and minimally initialized
> for each driver frag.
> In patch 5, create a sample drop and count program. With single core,
> achieved ~20 Mpps drop rate on a 40G mlx4. This includes packet data
> access, bpf array lookup, and increment.

Hrm. This doesnt sound very high (less than 50%?).
Is the driver the main overhead?
I'd be curious, for comparison, if you just dropped everything
without bpf and alternatively with tc + bpf of the same program
on the one cpu.
Numbers we had for the NUC with tc on single core were a bit higher
than 20Mpps but there was no driver overhead - so i expected
to see much higher numbers if you did it at the driver...
Note back in the day Alexey(not Alexei;->) had a built-in driver
level forwarder;
however the advantage there was derived out of packets being DMAed
from ingress to egress port after some simple lookup.

cheers,
jamal

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] rtnetlink: add new RTM_GETSTATS message to dump link stats
From: Jamal Hadi Salim @ 2016-04-09 14:30 UTC (permalink / raw)
  To: Roopa Prabhu, netdev; +Cc: davem
In-Reply-To: <1460183892-57286-2-git-send-email-roopa@cumulusnetworks.com>

Thanks for doing the work Roopa and I apologize for late comments
below:

On 16-04-09 02:38 AM, Roopa Prabhu wrote:
> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>

> This patch also allows for af family stats (an example af stats for IPV6
> is available with the second patch in the series).
>
> Like any other rtnetlink message, RTM_GETSTATS can be used to get stats of
> a single interface or all interfaces with NLM_F_DUMP.
>
> Future possible new types of stat attributes:
> - IFLA_MPLS_STATS  (nested. for mpls/mdev stats)
> - IFLA_EXTENDED_STATS (nested. extended software netdev stats like bridge,
>    vlan, vxlan etc)
> - IFLA_EXTENDED_HW_STATS (nested. extended hardware stats which are
>    available via ethtool today)
>

I got the extended_hw_stats (which are very common in a lot of ASICS) if
you mean stats on packet sizes. But would the other extended stats not
just be per netdev kind specific? We have concept of XSTATS which maybe
a fit.

> This patch also declares a filter mask for all stat attributes.
> User has to provide a mask of stats attributes to query. This will be
> specified in a new hdr 'struct if_stats_msg' for stats messages.
>
> Without any attributes in the filter_mask, no stats will be returned.
>

Should such a command then not be rejected with an error code?

> +/* STATS section */
> +
> +struct if_stats_msg {
> +	__u8  family;
> +	__u32 ifindex;
> +	__u32 filter_mask;
> +};

Needs to be 32 bit aligned.
Do you need 32 bits for the filter mask?
Perhaps a 16bit mask and an 8bit pad for future use.

struct if_stats_msg {
            __u32 ifindex;
	   __u16 filter_mask;
	   __u8  family;
            __u8 pad; /* future use */
};

Or you could reverse those (from smallest to largest).
BTW, any plans to do the cool feature where i inject a timeout period
and i just get STATS events ;-> The filter struct would have to be more
sophisticated - user would need to pass a list of ifindices and
filter_mask as well as timeout.

cheers,
jamal

^ permalink raw reply

* [PATCH net-next 1/6] net: Make vxlan/geneve default udp ports public
From: Manish Chopra @ 2016-04-09 13:17 UTC (permalink / raw)
  To: davem; +Cc: netdev, Ariel.Elior, Yuval.Mintz
In-Reply-To: <1460207825-3622-1-git-send-email-manish.chopra@qlogic.com>

This patch defines default UDP ports for vxlan and geneve
in their respective header files to be accessed by the driver.

Rationale behind this change is that with some OVS configuration
UDP ports doesn't get notified to the driver using
.ndo_[add|del]_vxlan_port. So for the driver to work with
these specific ports in that environment we need to have them configured
on adapter by default for the required hardware offload support.

Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
---
 drivers/net/geneve.c | 4 +---
 drivers/net/vxlan.c  | 2 +-
 include/net/geneve.h | 1 +
 include/net/vxlan.h  | 2 ++
 4 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index a9fbf17..4f8a1bb 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -23,8 +23,6 @@
 
 #define GENEVE_NETDEV_VER	"0.6"
 
-#define GENEVE_UDP_PORT		6081
-
 #define GENEVE_N_VID		(1u << 24)
 #define GENEVE_VID_MASK		(GENEVE_N_VID - 1)
 
@@ -1361,7 +1359,7 @@ static int geneve_configure(struct net *net, struct net_device *dev,
 static int geneve_newlink(struct net *net, struct net_device *dev,
 			  struct nlattr *tb[], struct nlattr *data[])
 {
-	__be16 dst_port = htons(GENEVE_UDP_PORT);
+	__be16 dst_port = htons(GENEVE_DEF_UDP_PORT);
 	__u8 ttl = 0, tos = 0;
 	bool metadata = false;
 	union geneve_addr remote = geneve_remote_unspec;
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 9f36340..1d7af21 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -62,7 +62,7 @@
  * The IANA assigned port is 4789, but the Linux default is 8472
  * for compatibility with early adopters.
  */
-static unsigned short vxlan_port __read_mostly = 8472;
+static unsigned short vxlan_port __read_mostly = VXLAN_DEF_UDP_PORT;
 module_param_named(udp_port, vxlan_port, ushort, 0444);
 MODULE_PARM_DESC(udp_port, "Destination UDP port");
 
diff --git a/include/net/geneve.h b/include/net/geneve.h
index e6c23dc..3c3ee4a 100644
--- a/include/net/geneve.h
+++ b/include/net/geneve.h
@@ -5,6 +5,7 @@
 #include <net/udp_tunnel.h>
 #endif
 
+#define GENEVE_DEF_UDP_PORT 6081
 
 /* Geneve Header:
  *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 2f168f0..5d1b27f 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -9,6 +9,8 @@
 #include <linux/udp.h>
 #include <net/dst_metadata.h>
 
+#define VXLAN_DEF_UDP_PORT 8472
+
 /* VXLAN protocol (RFC 7348) header:
  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  * |R|R|R|R|I|R|R|R|               Reserved                        |
-- 
2.7.2

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox