Netdev List
 help / color / mirror / Atom feed
* Re: Initial thoughts on TXDP
From: Hannes Frederic Sowa @ 2016-12-01 22:47 UTC (permalink / raw)
  To: Tom Herbert, Florian Westphal
  Cc: Linux Kernel Network Developers, Jesper Dangaard Brouer
In-Reply-To: <CALx6S36ywu3ruY7AFKYk=N4Ekr5zjY33ivx92EgNNT36XoXhFA@mail.gmail.com>

Side note:

On 01.12.2016 20:51, Tom Herbert wrote:
>> > E.g. "mini-skb": Even if we assume that this provides a speedup
>> > (where does that come from? should make no difference if a 32 or
>> >  320 byte buffer gets allocated).
>> >
> It's the zero'ing of three cache lines. I believe we talked about that
> as netdev.

Jesper and me played with that again very recently:

https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590

In micro-benchmarks we saw a pretty good speed up not using the rep
stosb generated by gcc builtin but plain movq's. Probably the cost model
for __builtin_memset in gcc is wrong?

When Jesper is free we wanted to benchmark this and maybe come up with a
arch specific way of cleaning if it turns out to really improve throughput.

SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
all the benefits.

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH net-next v3] ipv6 addrconf: Implemented enhanced DAD (RFC7527)
From: Hannes Frederic Sowa @ 2016-12-01 22:28 UTC (permalink / raw)
  To: Erik Nordmark, davem; +Cc: netdev, Bob Gilligan
In-Reply-To: <1480549159-8142-1-git-send-email-nordmark@arista.com>

On 01.12.2016 00:39, Erik Nordmark wrote:
> Implemented RFC7527 Enhanced DAD.
> IPv6 duplicate address detection can fail if there is some temporary
> loopback of Ethernet frames. RFC7527 solves this by including a random
> nonce in the NS messages used for DAD, and if an NS is received with the
> same nonce it is assumed to be a looped back DAD probe and is ignored.
> RFC7527 is enabled by default. Can be disabled by setting both of
> conf/{all,interface}/enhanced_dad to zero.
> 
> Signed-off-by: Erik Nordmark <nordmark@arista.com>
> Signed-off-by: Bob Gilligan <gilligan@arista.com>
> ---

Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Thanks!
> @@ -794,6 +808,17 @@ static void ndisc_recv_ns(struct sk_buff *skb)
>  have_ifp:
>  		if (ifp->flags & (IFA_F_TENTATIVE|IFA_F_OPTIMISTIC)) {
>  			if (dad) {
> +				if (nonce != 0 && ifp->dad_nonce == nonce) {
> +					u8 *np = (u8 *)&nonce;
> +					/* Matching nonce if looped back */
> +					ND_PRINTK(2, notice,
> +						  "%s: IPv6 DAD loopback for address %pI6c nonce %02x:%02x:%02x:%02x:%02x:%02x ignored\n",
> +						  ifp->idev->dev->name,
> +						  &ifp->addr,
> +						  np[0], np[1], np[2], np[3],
> +						  np[4], np[5]);
> +					goto out;
> +				}
>  				/*
>  				 * We are colliding with another node
>  				 * who is doing DAD
> 

I think it could be a "%pM" because it looks like a MAC address, but
better leave it like that. :)

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH] mm: page_alloc: High-order per-cpu page allocator v3
From: Paolo Abeni @ 2016-12-01 22:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Mel Gorman, Andrew Morton, Christoph Lameter, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Linux-MM, Linux-Kernel,
	Rick Jones, netdev@vger.kernel.org, Hannes Frederic Sowa
In-Reply-To: <20161201183402.2fbb8c5b@redhat.com>

On Thu, 2016-12-01 at 18:34 +0100, Jesper Dangaard Brouer wrote:
> (Cc. netdev, we might have an issue with Paolo's UDP accounting and
> small socket queues)
> 
> On Wed, 30 Nov 2016 16:35:20 +0000
> Mel Gorman <mgorman@techsingularity.net> wrote:
> 
> > > I don't quite get why you are setting the socket recv size
> > > (with -- -s and -S) to such a small number, size + 256.
> > >   
> > 
> > Maybe I missed something at the time I wrote that but why would it
> > need to be larger?
> 
> Well, to me it is quite obvious that we need some queue to avoid packet
> drops.  We have two processes netperf and netserver, that are sending
> packets between each-other (UDP_STREAM mostly netperf -> netserver).
> These PIDs are getting scheduled and migrated between CPUs, and thus
> does not get executed equally fast, thus a queue is need absorb the
> fluctuations.
> 
> The network stack is even partly catching your config "mistake" and
> increase the socket queue size, so we minimum can handle one max frame
> (due skb "truesize" concept approx PAGE_SIZE + overhead).
> 
> Hopefully for localhost testing a small queue should hopefully not
> result in packet drops.  Testing... ups, this does result in packet
> drops.
> 
> Test command extracted from mmtests, UDP_STREAM size 1024:
> 
>  netperf-2.4.5-installed/bin/netperf -t UDP_STREAM  -l 60  -H 127.0.0.1 \
>    -- -s 1280 -S 1280 -m 1024 -M 1024 -P 15895
> 
>  UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0)
>   port 15895 AF_INET to 127.0.0.1 (127.0.0.1) port 15895 AF_INET
>  Socket  Message  Elapsed      Messages                
>  Size    Size     Time         Okay Errors   Throughput
>  bytes   bytes    secs            #      #   10^6bits/sec
> 
>    4608    1024   60.00     50024301      0    6829.98
>    2560           60.00     46133211           6298.72
> 
>  Dropped packets: 50024301-46133211=3891090
> 
> To get a better drop indication, during this I run a command, to get
> system-wide network counters from the last second, so below numbers are
> per second.
> 
>  $ nstat > /dev/null && sleep 1  && nstat
>  #kernel
>  IpInReceives                    885162             0.0
>  IpInDelivers                    885161             0.0
>  IpOutRequests                   885162             0.0
>  UdpInDatagrams                  776105             0.0
>  UdpInErrors                     109056             0.0
>  UdpOutDatagrams                 885160             0.0
>  UdpRcvbufErrors                 109056             0.0
>  IpExtInOctets                   931190476          0.0
>  IpExtOutOctets                  931189564          0.0
>  IpExtInNoECTPkts                885162             0.0
> 
> So, 885Kpps but only 776Kpps delivered and 109Kpps drops. See
> UdpInErrors and UdpRcvbufErrors is equal (109056/sec). This drop
> happens kernel side in __udp_queue_rcv_skb[1], because receiving
> process didn't empty it's queue fast enough see [2].
> 
> Although upstream changes are coming in this area, [2] is replaced with
> __udp_enqueue_schedule_skb, which I actually tested with... hmm
> 
> Retesting with kernel 4.7.0-baseline+ ... show something else. 
> To Paolo, you might want to look into this.  And it could also explain why
> I've not see the mentioned speedup by mm-change, as I've been testing
> this patch on top of net-next (at 93ba2222550) with Paolo's UDP changes.

Thank you for reporting this.

It seems that the commit 123b4a633580 ("udp: use it's own memory
accounting schema") is too strict while checking the rcvbuf. 

For very small value of rcvbuf, it allows a single skb to be enqueued,
while previously we allowed 2 of them to enter the queue, even if the
first one truesize exceeded rcvbuf, as in your test-case.

Can you please try the following patch ?

Thank you,

Paolo
---
 net/ipv4/udp.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index e1d0bf8..2f5dc92 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1200,19 +1200,21 @@ int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
 	struct sk_buff_head *list = &sk->sk_receive_queue;
 	int rmem, delta, amt, err = -ENOMEM;
 	int size = skb->truesize;
+	int limit;
 
 	/* try to avoid the costly atomic add/sub pair when the receive
 	 * queue is full; always allow at least a packet
 	 */
 	rmem = atomic_read(&sk->sk_rmem_alloc);
-	if (rmem && (rmem + size > sk->sk_rcvbuf))
+	limit = size + sk->sk_rcvbuf;
+	if (rmem > limit)
 		goto drop;
 
 	/* we drop only if the receive buf is full and the receive
 	 * queue contains some other skb
 	 */
 	rmem = atomic_add_return(size, &sk->sk_rmem_alloc);
-	if ((rmem > sk->sk_rcvbuf) && (rmem > size))
+	if (rmem > limit)
 		goto uncharge_drop;
 
 	spin_lock(&list->lock);





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH 1/1] NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040
From: Bjørn Mork @ 2016-12-01 22:13 UTC (permalink / raw)
  To: Daniele Palmas; +Cc: netdev
In-Reply-To: <1480607525-23044-2-git-send-email-dnlplm@gmail.com>

Daniele Palmas <dnlplm@gmail.com> writes:

> This patch adds support for PID 0x1040 of Telit LE922A.

LGTM

Acked-by: Bjørn Mork <bjorn@mork.no>

^ permalink raw reply

* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 22:12 UTC (permalink / raw)
  To: Rick Jones; +Cc: Sowmini Varadhan, Linux Kernel Network Developers
In-Reply-To: <1e88fd64-0045-beb5-101a-a55b8f54bd08@hpe.com>

On Thu, Dec 1, 2016 at 1:47 PM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 12/01/2016 12:18 PM, Tom Herbert wrote:
>>
>> On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jones2@hpe.com> wrote:
>>>
>>> Just how much per-packet path-length are you thinking will go away under
>>> the
>>> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does
>>> some
>>> non-trivial things to effective overhead (service demand) and so
>>> throughput:
>>>
>> For plain in order TCP packets I believe we should be able process
>> each packet at nearly same speed as GRO. Most of the protocol
>> processing we do between GRO and the stack are the same, the
>> differences are that we need to do a connection lookup in the stack
>> path (note we now do this is UDP GRO and that hasn't show up as a
>> major hit). We also need to consider enqueue/dequeue on the socket
>> which is a major reason to try for lockless sockets in this instance.
>
>
> So waving hands a bit, and taking the service demand for the GRO-on receive
> test in my previous message (860 ns/KB), that would be ~ (1448/1024)*860 or
> ~1.216 usec of CPU time per TCP segment, including ACK generation which
> unless an explicit ACK-avoidance heuristic a la HP-UX 11/Solaris 2 is put in
> place would be for every-other segment. Etc etc.
>
>> Sure, but trying running something emulates a more realistic workload
>> than a TCP stream, like RR test with relative small payload and many
>> connections.
>
>
> That is a good point, which of course is why the RR tests are there in
> netperf :) Don't get me wrong, I *like* seeing path-length reductions. What
> would you posit is a relatively small payload?  The promotion of IR10
> suggests that perhaps 14KB or so is a sufficiently common so I'll grasp at
> that as the length of a piece of string:
>
We have consider both request size and response side in RPC.
Presumably, something like a memcache server is most serving data as
opposed to reading it, we are looking to receiving much smaller
packets than being sent. Requests are going to be quite small say 100
bytes and unless we are doing significant amount of pipelining on
connections GRO would rarely kick-in. Response size will have a lot of
variability, anything from a few kilobytes up to a megabyte. I'm sorry
I can't be more specific this is an artifact of datacenters that have
100s of different applications and communication patterns. Maybe 100b
request size, 8K, 16K, 64K response sizes might be good for test.

> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,14K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,14K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
> 16384  87380
>
> So, losing GRO doubled the service demand.  I suppose I could see cutting
> path-length in half based on the things you listed which would be bypassed?
>
> I'm sure mileage will vary with different NICs and CPUs.  The ones used here
> happened to be to hand.
>
This is also biased because you're using a single connection, but is
consistent with data we've seen in the past. To be clear I'm not
saying GRO is bad, the fact that GRO has such a visible impact in your
test means that the GRO path is significantly more efficient. Closing
that gap seen in your numbers would be a benefit, that means we have
improved per packet processing.

Tom

> happy benchmarking,
>
> rick
>
> Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly
> goes to more than doubling, and halving to 7K narrows the delta:
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,28K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     28672  10.00   6732.32  1.79   -1.00  63.819  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,28K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     28672  10.00   3780.47  2.32   -1.00  147.280  -1.000
> 16384  87380
>
>
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,7K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     7168   10.00   10535.01  1.52   -1.00  34.664  -1.000
> 16384  87380
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_RR -- -P 12867 -r 128,7K
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET
> to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr
>
> 16384  87380  128     7168   10.00   8225.17  1.80   -1.00  52.661  -1.000
> 16384  87380
>
>

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01 22:10 UTC (permalink / raw)
  To: David Miller; +Cc: brouer, saeedm, rick.jones2, netdev, saeedm, tariqt
In-Reply-To: <20161201.152029.2045474106824619667.davem@davemloft.net>

On Thu, 2016-12-01 at 15:20 -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 01 Dec 2016 09:04:17 -0800
> 
> > On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
> > 
> >> When qdisc layer or trafgen/af_packet see this indication it knows it
> >> should/must flush the queue when it don't have more work left.  Perhaps
> >> through net_tx_action(), by registering itself and e.g. if qdisc_run()
> >> is called and queue is empty then check if queue needs a flush. I would
> >> also allow driver to flush and clear this bit.
> > 
> > net_tx_action() is not normally called, unless BQL limit is hit and/or
> > some qdiscs with throttling (HTB, TBF, FQ, ...)
> 
> The one thing I wonder about is whether we should "ramp up" into a mode
> where the NAPI poll does the doorbells instead of going directly there.
> 
> Maybe I misunderstand your algorithm, but it looks to me like if there
> are any active packets in the TX queue at enqueue time you will defer
> the doorbell to the interrupt handler.
> 
> Let's say we put 1 packet in, and hit the doorbell.
> 
> Then another packet comes in and we defer the doorbell to the IRQ.
> 
> At this point there are a couple things I'm unclear about.


> 
> For example, if we didn't hit the doorbell, will the chip still take a
> peek at the second descriptor?  Depending upon how the doorbell works
> it might, or it might not.

It might depend on the hardware. I can easily check on mlx4, by
increasing tx-usecs and tx-frames, and sending 2 packets back to back.

> 
> Either way, wouldn't there be a possible condition where the chip
> wouldn't see the second enqueued packet and we'd thus have the wire
> idle until the interrupt + NAPI runs and hits the doorbell?
> 
> This is why I think we should "ramp up" the doorbell deferral, in
> order to avoid this potential wire idle time situation.


> 
> Maybe the situation I'm worried about is not possible, so please
> explain it to me :-)

This is absolutely the problem. We might need to enable this mode only
above a given load. We could have an EWMA of the number of packets
that TX completion runs can dequeue. And enable auto doorbell only if we
have that many packets in the TX ring (instead of the  "1 packet
threshold" of the WIP)

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01 22:04 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesper Dangaard Brouer, Rick Jones, Netdev, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <CAKgT0Uf_EeWYqGinyV6U1r-Jqd07R9gN-7GeA82emk4EoN=1mQ@mail.gmail.com>

On Thu, 2016-12-01 at 13:32 -0800, Alexander Duyck wrote:

> A few years back when I did something like this on ixgbe I was told by
> you that the issue was that doing something like this would add too
> much latency.  I was just wondering what the latency impact is on a
> change like this and if this now isn't that much of an issue?

I believe it is clear that we can not use this trick without admin
choice.

In my experience, mlx4 can deliver way more interrupts than ixgbe.
(No idea why)

This "auto doorbell" is tied to proper "ethtool -c tx-usecs|tx-frames|
tx-frames-irq", or simply a choice for hosts dedicated to content
delivery (like video servers) with 40GBit+ cards.

Under stress, softirq can be handled by ksoftirqd, and might be delayed
by ~10 ms or even more.

This WIP was minimal, but we certainly need to add belts and suspenders.

1) <maybe> timestamps
  use a ktime_get_ns() to remember a timestamp for the last doorbell,
  and force a doorbell if it gets too old, checked in ndo_start_xmit()
  every time we would like to not send the doorbell.

2) <maybe> max numbers of packets not yet doorbelled.

But we can not rely on another high resolution timer, since they also
require softirq being processed timely.

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 21:57 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130182243.cpoyoqvxuv2bnn4y@splinter.mtl.com>

On 30.11.2016 19:22, Ido Schimmel wrote:
> On Wed, Nov 30, 2016 at 05:49:56PM +0100, Hannes Frederic Sowa wrote:
>> On 30.11.2016 17:32, Ido Schimmel wrote:
>>> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
>>>> On 30.11.2016 11:09, Jiri Pirko wrote:
>>>>> From: Ido Schimmel <idosch@mellanox.com>
>>>>>
>>>>> Make sure the device has a complete view of the FIB tables by invoking
>>>>> their dump during module init.
>>>>>
>>>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>>>> ---
>>>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
>>>>>  1 file changed, 23 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> index 14bed1d..d176047 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
>>>>>  	return NOTIFY_DONE;
>>>>>  }
>>>>>  
>>>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
>>>>> +{
>>>>> +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
>>>>> +
>>>>> +	/* Flush pending FIB notifications and then flush the device's
>>>>> +	 * table before requesting another dump. Do that with RTNL held,
>>>>> +	 * as FIB notification block is already registered.
>>>>> +	 */
>>>>> +	mlxsw_core_flush_owq();
>>>>> +	rtnl_lock();
>>>>> +	mlxsw_sp_router_fib_flush(mlxsw_sp);
>>>>> +	rtnl_unlock();
>>>>> +}
>>>>> +
>>>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>>>  {
>>>>> +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
>>>>>  	int err;
>>>>>  
>>>>>  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
>>>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>>>  
>>>>>  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
>>>>>  	register_fib_notifier(&mlxsw_sp->fib_nb);
>>>>
>>>> Sorry to pick in here again:
>>>>
>>>> There is a race here. You need to protect the registration of the fib
>>>> notifier as well by the sequence counter. Updates here are not ordered
>>>> in relation to this code below.
>>>
>>> You mean updates that can be received after you registered the notifier
>>> and until the dump started? I'm aware of that and that's OK. This
>>> listener should be able to handle duplicates.
>>
>> I am not concerned about duplicates, but about ordering deletes and
>> getting an add from the RCU code you will add the node to hw while it is
>> deleted in the software path. You probably will ignore the delete
>> because nothing is installed in hw and later add the node which was
>> actually deleted but just reordered which happend on another CPU, no?
> 
> Are you referring to reordering in the workqueue? We already covered
> this using an ordered workqueue, which has one context of execution
> system-wide.

Ups, sorry, I missed that mail. Probably read it on the mobile phone and
it became invisible for me later on. Busy day... ;)

The reordering in the workqueue seems fine to me and also still necessary.

Basically, if you delete a node right now the kernel might simply do a
RCU_INIT_POINTER(ptr_location, NULL), which has absolutely no barriers
or synchronization with the reader side. Thus you might get a callback
from the notifier for a delete event on the one CPU and you end up
queueing this fib entry after the delete queue, because the RCU walk
isn't protected by any means.

Looking closer at this series again, I overlooked the fact that you
fetch fib_seq using a rtnl_lock and rtnl_unlock pair, which first of all
orders fetching of fib_seq and thus the RCU dumping after any concurrent
executing fib table update, also the mutex_lock and unlock provide
proper acquire and release fences, so the CPU indeed sees the effect of
a RCU_INIT_POINTER update done on another CPU, because they pair with
the rtnl_unlock which might happen on the other CPU.

My question is if this is a bit of luck and if we should make this
explicit by putting the registration itself under the protection of the
sequence counter. I favor the additional protection, e.g. if we some day
actually we optimize the fib_seq code? Otherwise we might probably
document this fact. :)

>>> I've a follow up patchset that introduces a new event in switchdev
>>> notification chain called SWITCHDEV_SYNC, which is sent when port
>>> netdevs are enslaved / released  from a master device (points in time
>>> where kernel<->device can get out of sync). It will invoke
>>> re-propagation of configuration from different parts of the stack
>>> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
>>> in duplicates.
>>
>> Okay, understood. I wonder how we can protect against accidentally abort
>> calls actually. E.g. if I start to inject routes into my routing domain
>> how can I make sure the box doesn't die after I try to insert enough
>> routes. Do we need to touch quagga etc?
> 
> The whole point of moving abort mechanism to the driver is that the
> system won't die, but instead routing will be done in the kernel. If you
> respect hardware limitations, then there's no reason for abort mechanism
> to kick in.

Quick follow-up question: How can I quickly find out the hw limitations
via the kernel api?

Thanks,
Hannes

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Tom Herbert @ 2016-12-01 21:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Thomas Graf, Florian Westphal, Linux Kernel Network Developers
In-Reply-To: <9b4264f8-26b9-a611-56f0-0840cecf9c44@stressinduktion.org>

On Thu, Dec 1, 2016 at 1:27 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 01.12.2016 22:12, Tom Herbert wrote:
>> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>>> Hello,
>>>
>>> this is a good conversation and I simply want to bring my worries
>>> across. I don't have good solutions for the problems XDP tries to solve
>>> but I fear we could get caught up in maintenance problems in the long
>>> term given the ideas floating around on how to evolve XDP currently.
>>>
>>> On 01.12.2016 17:28, Thomas Graf wrote:
>>>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>>>>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>>>>> XDP manipulates packets at free will and thus all security guarantees
>>>>> are off as well as in any user space solution.
>>>>>
>>>>> Secondly user space provides policy, acl, more controlled memory
>>>>> protection, restartability and better debugability. If I had multi
>>>>> tenant workloads I would definitely put more complex "business/acl"
>>>>> logic into user space, so I can make use of LSM and other features to
>>>>> especially prevent a network facing service to attack the tenants. If
>>>>> stuff gets put into the kernel you run user controlled code in the
>>>>> kernel exposing a much bigger attack vector.
>>>>>
>>>>> What use case do you see in XDP specifically e.g. for container networking?
>>>>
>>>> DDOS mitigation to protect distributed applications in large clusters.
>>>> Relying on CDN works to protect API gateways and frontends (as long as
>>>> they don't throw you out of their network) but offers no protection
>>>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>>>> level and allowing the mitigation capability to scale up with the number
>>>> of servers is natural and cheap.
>>>
>>> So far we e.g. always considered L2 attacks a problem of the network
>>> admin to correctly protect the environment. Are you talking about
>>> protecting the L3 data plane? Are there custom proprietary protocols in
>>> place which need custom protocol parsers that need involvement of the
>>> kernel before it could verify the packet?
>>>
>>> In the past we tried to protect the L3 data plane as good as we can in
>>> Linux to allow the plain old server admin to set an IP address on an
>>> interface and install whatever software in user space. We try not only
>>> to protect it but also try to achieve fairness by adding a lot of
>>> counters everywhere. Are protections missing right now or are we talking
>>> about better performance?
>>>
>> The technical plenary at last IETF on Seoul a couple of weeks ago was
>> exclusively focussed on DDOS in light of the recent attack against
>> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
>> presentation by Nick Sullivan
>> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
>> alluded to some implementation of DDOS mitigation. In particular, on
>> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
>> numbers he gave we're based in iptables+BPF and that was a whole
>> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
>> and that's also when I introduced XDP to whole IETF :-) ). If that's
>> the best we can do the Internet is in a world hurt. DDOS mitigation
>> alone is probably a sufficient motivation to look at XDP. We need
>> something that drops bad packets as quickly as possible when under
>> attack, we need this to be integrated into the stack, we need it to be
>> programmable to deal with the increasing savvy of attackers, and we
>> don't want to be forced to be dependent on HW solutions. This is why
>> we created XDP!
>
> I totally understand that. But in my reply to David in this thread I
> mentioned DNS apex processing as being problematic which is actually
> being referred in your linked slide deck on page 9 ("What do floods look
> like") and the problematic of parsing DNS packets in XDP due to string
> processing and looping inside eBPF.
>
I agree that eBPF is not going to be sufficient from everything we'll
want to do. Undoubtably, we'll continue see new addition of more
helpers to assist in processing, but at some point we will want a to
load a kernel module that handles more complex processing and insert
it at the XDP callout. Nothing in the design of XDP precludes doing
that and I have already posted the patches to generalize the XDP
callout for that. Taking either of these routes has tradeoffs, but
regardless of whether this is BPF or module code, the principles of
XDP and its value to help solve some class of problems remains.

 Tom

> Not to mention the fact that you might have to deal with fragments in
> the Internet. Some DOS mitigations were already abused to generate
> blackholes for other users. Filtering such stuff is quite complicated.
>
> I argued also under the aspect of what Thomas said, that the outside
> world of the cluster is already protected by a CDN.
>
> Bye,
> Hannes
>

^ permalink raw reply

* [PATCH] net: ethernet: ti: davinci_cpdma: sanitize inter-module API
From: Arnd Bergmann @ 2016-12-01 21:51 UTC (permalink / raw)
  To: Mugunthan V N
  Cc: Arnd Bergmann, Grygorii Strashko, David S. Miller,
	Ivan Khoronzhuk, Uwe Kleine-König, linux-omap, netdev,
	linux-kernel

The davinci_cpdma module is a helper library that is used by the
actual device drivers and does nothing by itself, so all its API
functions need to be exported.

Four new functions were added recently without an export, so now
we get a build error:

ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!

This exports those symbols. After taking a closer look, I found two
more global functions in this file that are not exported and not used
anywhere, so they can obviously be removed. There is also one function
that is only used internally in this module, and should be 'static'.

Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a channel")
Fixes: 0fc6432cc78d ("net: ethernet: ti: davinci_cpdma: add weight function for channels")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/ethernet/ti/davinci_cpdma.c | 63 +++++++++++----------------------
 drivers/net/ethernet/ti/davinci_cpdma.h |  4 ---
 2 files changed, 21 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index c776e4575d2d..b9d40f0cdf6c 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -628,6 +628,23 @@ int cpdma_ctlr_destroy(struct cpdma_ctlr *ctlr)
 }
 EXPORT_SYMBOL_GPL(cpdma_ctlr_destroy);
 
+static int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&chan->lock, flags);
+	if (chan->state != CPDMA_STATE_ACTIVE) {
+		spin_unlock_irqrestore(&chan->lock, flags);
+		return -EINVAL;
+	}
+
+	dma_reg_write(chan->ctlr, enable ? chan->int_set : chan->int_clear,
+		      chan->mask);
+	spin_unlock_irqrestore(&chan->lock, flags);
+
+	return 0;
+}
+
 int cpdma_ctlr_int_ctrl(struct cpdma_ctlr *ctlr, bool enable)
 {
 	unsigned long flags;
@@ -796,6 +813,7 @@ int cpdma_chan_set_weight(struct cpdma_chan *ch, int weight)
 	spin_unlock_irqrestore(&ctlr->lock, flags);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_weight);
 
 /* cpdma_chan_get_min_rate - get minimum allowed rate for channel
  * Should be called before cpdma_chan_set_rate.
@@ -810,6 +828,7 @@ u32 cpdma_chan_get_min_rate(struct cpdma_ctlr *ctlr)
 
 	return DIV_ROUND_UP(divident, divisor);
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_min_rate);
 
 /* cpdma_chan_set_rate - limits bandwidth for transmit channel.
  * The bandwidth * limited channels have to be in order beginning from lowest.
@@ -853,6 +872,7 @@ int cpdma_chan_set_rate(struct cpdma_chan *ch, u32 rate)
 	spin_unlock_irqrestore(&ctlr->lock, flags);
 	return ret;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_set_rate);
 
 u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 {
@@ -865,6 +885,7 @@ u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
 
 	return rate;
 }
+EXPORT_SYMBOL_GPL(cpdma_chan_get_rate);
 
 struct cpdma_chan *cpdma_chan_create(struct cpdma_ctlr *ctlr, int chan_num,
 				     cpdma_handler_fn handler, int rx_type)
@@ -1270,46 +1291,4 @@ int cpdma_chan_stop(struct cpdma_chan *chan)
 }
 EXPORT_SYMBOL_GPL(cpdma_chan_stop);
 
-int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&chan->lock, flags);
-	if (chan->state != CPDMA_STATE_ACTIVE) {
-		spin_unlock_irqrestore(&chan->lock, flags);
-		return -EINVAL;
-	}
-
-	dma_reg_write(chan->ctlr, enable ? chan->int_set : chan->int_clear,
-		      chan->mask);
-	spin_unlock_irqrestore(&chan->lock, flags);
-
-	return 0;
-}
-
-int cpdma_control_get(struct cpdma_ctlr *ctlr, int control)
-{
-	unsigned long flags;
-	int ret;
-
-	spin_lock_irqsave(&ctlr->lock, flags);
-	ret = _cpdma_control_get(ctlr, control);
-	spin_unlock_irqrestore(&ctlr->lock, flags);
-
-	return ret;
-}
-
-int cpdma_control_set(struct cpdma_ctlr *ctlr, int control, int value)
-{
-	unsigned long flags;
-	int ret;
-
-	spin_lock_irqsave(&ctlr->lock, flags);
-	ret = _cpdma_control_set(ctlr, control, value);
-	spin_unlock_irqrestore(&ctlr->lock, flags);
-
-	return ret;
-}
-EXPORT_SYMBOL_GPL(cpdma_control_set);
-
 MODULE_LICENSE("GPL");
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.h b/drivers/net/ethernet/ti/davinci_cpdma.h
index 4a167db2abab..36d0a09a3d44 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.h
+++ b/drivers/net/ethernet/ti/davinci_cpdma.h
@@ -87,7 +87,6 @@ int cpdma_chan_process(struct cpdma_chan *chan, int quota);
 
 int cpdma_ctlr_int_ctrl(struct cpdma_ctlr *ctlr, bool enable);
 void cpdma_ctlr_eoi(struct cpdma_ctlr *ctlr, u32 value);
-int cpdma_chan_int_ctrl(struct cpdma_chan *chan, bool enable);
 u32 cpdma_ctrl_rxchs_state(struct cpdma_ctlr *ctlr);
 u32 cpdma_ctrl_txchs_state(struct cpdma_ctlr *ctlr);
 bool cpdma_check_free_tx_desc(struct cpdma_chan *chan);
@@ -111,7 +110,4 @@ enum cpdma_control {
 	CPDMA_RX_BUFFER_OFFSET,		/* read-write */
 };
 
-int cpdma_control_get(struct cpdma_ctlr *ctlr, int control);
-int cpdma_control_set(struct cpdma_ctlr *ctlr, int control, int value);
-
 #endif
-- 
2.9.0

^ permalink raw reply related

* Re: Initial thoughts on TXDP
From: Rick Jones @ 2016-12-01 21:47 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Sowmini Varadhan, Linux Kernel Network Developers
In-Reply-To: <CALx6S37aknqfrj66AP8hEfVT9X2OmBFK9GVa9A+0FpydPbm9kg@mail.gmail.com>

On 12/01/2016 12:18 PM, Tom Herbert wrote:
> On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jones2@hpe.com> wrote:
>> Just how much per-packet path-length are you thinking will go away under the
>> likes of TXDP?  It is admittedly "just" netperf but losing TSO/GSO does some
>> non-trivial things to effective overhead (service demand) and so throughput:
>>
> For plain in order TCP packets I believe we should be able process
> each packet at nearly same speed as GRO. Most of the protocol
> processing we do between GRO and the stack are the same, the
> differences are that we need to do a connection lookup in the stack
> path (note we now do this is UDP GRO and that hasn't show up as a
> major hit). We also need to consider enqueue/dequeue on the socket
> which is a major reason to try for lockless sockets in this instance.

So waving hands a bit, and taking the service demand for the GRO-on 
receive test in my previous message (860 ns/KB), that would be ~ 
(1448/1024)*860 or ~1.216 usec of CPU time per TCP segment, including 
ACK generation which unless an explicit ACK-avoidance heuristic a la 
HP-UX 11/Solaris 2 is put in place would be for every-other segment. Etc 
etc.

> Sure, but trying running something emulates a more realistic workload
> than a TCP stream, like RR test with relative small payload and many
> connections.

That is a good point, which of course is why the RR tests are there in 
netperf :) Don't get me wrong, I *like* seeing path-length reductions. 
What would you posit is a relatively small payload?  The promotion of 
IR10 suggests that perhaps 14KB or so is a sufficiently common so I'll 
grasp at that as the length of a piece of string:

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     14336  10.00   8118.31  1.57   -1.00  46.410  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,14K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     14336  10.00   5837.35  2.20   -1.00  90.628  -1.000
16384  87380

So, losing GRO doubled the service demand.  I suppose I could see 
cutting path-length in half based on the things you listed which would 
be bypassed?

I'm sure mileage will vary with different NICs and CPUs.  The ones used 
here happened to be to hand.

happy benchmarking,

rick

Just to get a crude feel for sensitivity, doubling to 28K unsurprisingly 
goes to more than doubling, and halving to 7K narrows the delta:

stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     28672  10.00   6732.32  1.79   -1.00  63.819  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,28K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     28672  10.00   3780.47  2.32   -1.00  147.280  -1.000
16384  87380



stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,7K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     7168   10.00   10535.01  1.52   -1.00  34.664  -1.000
16384  87380
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t 
TCP_RR -- -P 12867 -r 128,7K
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 12867 
AF_INET to np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo : first burst 0
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % U    us/Tr   us/Tr

16384  87380  128     7168   10.00   8225.17  1.80   -1.00  52.661  -1.000
16384  87380

^ permalink raw reply

* Re: DSA vs. SWTICHDEV ?
From: Murali Karicheri @ 2016-12-01 21:38 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Joakim Tjernlund, netdev@vger.kernel.org, Roger Quadros,
	Grygorii Strashko
In-Reply-To: <20161201173100.GG21887@lunn.ch>

Hi Andrew,
On 12/01/2016 12:31 PM, Andrew Lunn wrote:
> Hi Murali 
> 
>> 2. Switch mode where it implements a simple Ethernet switch. Currently
>>    it doesn't have address learning capability, but in future it
>>    can.
> 
> If it does not have address learning capabilities, does it act like a
> plain old hub? What comes in one port goes out all others?

Thanks for the response!

Yes. It is a plain hub. it replicates frame to both ports. So need to
run a bridge layer for address learning in software.

> 
> Or can you do the learning in software on the host and program tables,
> which the hardware then uses?
> 

I think not. I see we have a non Linux implementation that does address
learning in software using a hash table and look up MAC for each packet
to see which port it needs to be sent to.

Murali

>> 3. Switch with HSR/PRP offload where it provides HSR/PRP protocol
>>    support and cut through switch.
>>
>> So a device need to function in one of the modes. A a regular Ethernet
>> driver that provides two network devices, one per port, and switchdev
>> for each physical port (in switch mode) will look ideal in this case.
> 
> Yes, this seems the right model to use.
> 
>      Andrew
> 


-- 
Murali Karicheri
Linux Kernel, Keystone

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Alexander Duyck @ 2016-12-01 21:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Rick Jones, Netdev, Saeed Mahameed,
	Tariq Toukan
In-Reply-To: <1480402716.18162.124.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, Nov 28, 2016 at 10:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>
>
>> Not sure it this has been tried before, but the doorbell avoidance could
>> be done by the driver itself, because it knows a TX completion will come
>> shortly (well... if softirqs are not delayed too much !)
>>
>> Doorbell would be forced only if :
>>
>> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> OR
>> ( too many [1] packets were put in TX ring buffer, no point deferring
>> more)
>>
>> Start the pump, but once it is started, let the doorbells being done by
>> TX completion.
>>
>> ndo_start_xmit and TX completion handler would have to maintain a shared
>> state describing if packets were ready but doorbell deferred.
>>
>>
>> Note that TX completion means "if at least one packet was drained",
>> otherwise busy polling, constantly calling napi->poll() would force a
>> doorbell too soon for devices sharing a NAPI for both RX and TX.
>>
>> But then, maybe busy poll would like to force a doorbell...
>>
>> I could try these ideas on mlx4 shortly.
>>
>>
>> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>
> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.
>
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:26         eth0      0.00 5822800.00      0.00 597064.41      0.00      0.00      1.00
> 22:43:27         eth0     24.00 5788237.00      2.09 593520.26      0.00      0.00      0.00
> 22:43:28         eth0     12.00 5817777.00      1.43 596551.47      0.00      0.00      1.00
> 22:43:29         eth0     22.00 5841516.00      1.61 598982.87      0.00      0.00      0.00
> 22:43:30         eth0      4.00 4389137.00      0.71 450058.08      0.00      0.00      1.00
> 22:43:31         eth0      4.00 5871008.00      0.72 602007.79      0.00      0.00      0.00
> 22:43:32         eth0     12.00 5891809.00      1.43 604142.60      0.00      0.00      1.00
> 22:43:33         eth0     10.00 5901904.00      1.12 605175.70      0.00      0.00      0.00
> 22:43:34         eth0      5.00 5907982.00      0.69 605798.99      0.00      0.00      1.00
> 22:43:35         eth0      2.00 5847086.00      0.12 599554.71      0.00      0.00      0.00
> Average:         eth0      9.50 5707925.60      0.99 585285.69      0.00      0.00      0.50
> lpaa23:~# echo 1 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:43:47         eth0      9.00 10226424.00      1.02 1048608.05      0.00      0.00      1.00
> 22:43:48         eth0      1.00 10316955.00      0.06 1057890.89      0.00      0.00      0.00
> 22:43:49         eth0      1.00 10310104.00      0.10 1057188.32      0.00      0.00      1.00
> 22:43:50         eth0      0.00 10249423.00      0.00 1050966.23      0.00      0.00      0.00
> 22:43:51         eth0      0.00 10210441.00      0.00 1046969.05      0.00      0.00      1.00
> 22:43:52         eth0      2.00 10198389.00      0.16 1045733.17      0.00      0.00      1.00
> 22:43:53         eth0      8.00 10079257.00      1.43 1033517.83      0.00      0.00      0.00
> 22:43:54         eth0      0.00 7693509.00      0.00 788885.16      0.00      0.00      0.00
> 22:43:55         eth0      2.00 10343076.00      0.20 1060569.32      0.00      0.00      1.00
> 22:43:56         eth0      1.00 10224571.00      0.14 1048417.93      0.00      0.00      0.00
> Average:         eth0      2.40 9985214.90      0.31 1023874.60      0.00      0.00      0.50
>
> And about 11 % improvement on an mono-flow UDP_STREAM test.
>
> skb_set_owner_w() is now the most consuming function.

A few years back when I did something like this on ixgbe I was told by
you that the issue was that doing something like this would add too
much latency.  I was just wondering what the latency impact is on a
change like this and if this now isn't that much of an issue?

>
> lpaa23:~# ./udpsnd -4 -H 10.246.7.152 -d 2 &
> [1] 13696
> lpaa23:~# echo 0 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:50:47         eth0      3.00 1355422.00      0.45 319706.04      0.00      0.00      0.00
> 22:50:48         eth0      2.00 1344270.00      0.42 317035.21      0.00      0.00      1.00
> 22:50:49         eth0      3.00 1350503.00      0.51 318478.34      0.00      0.00      0.00
> 22:50:50         eth0     29.00 1348593.00      2.86 318113.02      0.00      0.00      1.00
> 22:50:51         eth0     14.00 1354855.00      1.83 319508.56      0.00      0.00      0.00
> 22:50:52         eth0      7.00 1357794.00      0.73 320226.89      0.00      0.00      1.00
> 22:50:53         eth0      5.00 1326130.00      0.63 312784.72      0.00      0.00      0.00
> 22:50:54         eth0      2.00 994584.00      0.12 234598.40      0.00      0.00      1.00
> 22:50:55         eth0      5.00 1318209.00      0.75 310932.46      0.00      0.00      0.00
> 22:50:56         eth0     20.00 1323445.00      1.73 312178.11      0.00      0.00      1.00
> Average:         eth0      9.00 1307380.50      1.00 308356.18      0.00      0.00      0.50
> lpaa23:~# echo 3 >/sys/class/net/eth0/doorbell_opt
> lpaa23:~# sar -n DEV 1 10|grep eth0
> 22:51:03         eth0      4.00 1512055.00      0.54 356599.40      0.00      0.00      0.00
> 22:51:04         eth0      4.00 1507631.00      0.55 355609.46      0.00      0.00      1.00
> 22:51:05         eth0      4.00 1487789.00      0.42 350917.47      0.00      0.00      0.00
> 22:51:06         eth0      7.00 1474460.00      1.22 347811.16      0.00      0.00      1.00
> 22:51:07         eth0      2.00 1496529.00      0.24 352995.18      0.00      0.00      0.00
> 22:51:08         eth0      3.00 1485856.00      0.49 350425.65      0.00      0.00      1.00
> 22:51:09         eth0      1.00 1114808.00      0.06 262905.38      0.00      0.00      0.00
> 22:51:10         eth0      2.00 1510924.00      0.30 356397.53      0.00      0.00      1.00
> 22:51:11         eth0      2.00 1506408.00      0.30 355345.76      0.00      0.00      0.00
> 22:51:12         eth0      2.00 1499122.00      0.32 353668.75      0.00      0.00      1.00
> Average:         eth0      3.10 1459558.20      0.44 344267.57      0.00      0.00      0.50
>
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
>  include/linux/netdevice.h                    |    1
>  net/core/net-sysfs.c                         |   18 +++
>  5 files changed, 83 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6562f78b07f4..fbea83218fc0 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>
>         if (polled) {
>                 if (doorbell_pending)
> -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>
>                 mlx4_cq_set_ci(&cq->mcq);
>                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..affebb435679 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
>         ring->size = size;
>         ring->size_mask = size - 1;
>         ring->sp_stride = stride;
> -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>

Just wondering if this should be a separate change?  Seems like this
is something that might apply outside of your changes since it seems
like it is reserving enough room for 2 buffers.

>         tmp = size * sizeof(struct mlx4_en_tx_info);
>         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
>         ring->sp_cqn = cq;
>         ring->prod = 0;
>         ring->cons = 0xffffffff;
> +       ring->ncons = 0;
>         ring->last_nr_txbb = 1;
>         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
>         memset(ring->buf, 0, ring->buf_size);
> @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
>                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
>  }
>
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>  {
> -       return ring->prod - ring->cons > ring->full_size;
> +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>  }
>

With the use of READ_ONCE are there some barriers that are going to
need to be added to make sure these are populated in the correct
spots?  Any ordering constraints on the updates these fields?

>  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>
>         /* Skip last polled descriptor */
>         ring->cons += ring->last_nr_txbb;
> +       ring->ncons += ring->last_nr_txbb;
>         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
>                  ring->cons, ring->prod);
>
> @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>                                                 !!(ring->cons & ring->size), 0,
>                                                 0 /* Non-NAPI caller */);
>                 ring->cons += ring->last_nr_txbb;
> +               ring->ncons += ring->last_nr_txbb;
>                 cnt++;
>         }
>
> @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>         return cnt;
>  }
>
> +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> +                          struct mlx4_en_tx_ring *ring)
> +{
> +
> +       if (dev->doorbell_opt & 1) {
> +               u32 oval = READ_ONCE(ring->prod_bell);
> +               u32 nval = READ_ONCE(ring->prod);
> +
> +               if (oval == nval)
> +                       return;
> +
> +               /* I can not tell yet if a cmpxchg() is needed or not */
> +               if (dev->doorbell_opt & 2)
> +                       WRITE_ONCE(ring->prod_bell, nval);
> +               else
> +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> +                               return;
> +       }
> +       /* Since there is no iowrite*_native() that writes the
> +        * value as is, without byteswapping - using the one
> +        * the doesn't do byteswapping in the relevant arch
> +        * endianness.
> +        */
> +#if defined(__LITTLE_ENDIAN)
> +       iowrite32(
> +#else
> +       iowrite32be(
> +#endif
> +                 ring->doorbell_qpn,
> +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> +}
> +

I realize you are just copying from the function further down, but
couldn't this just be __raw_writel instead of iowrite32?

Also you may be able to squeeze a bit more out of this by doing some
barrier clean-ups.  In most drivers the wmb() is needed between writes
of the DMA memory and the MMIO memory.  So odds are you could hold off
on using a wmb() until you call this function and you wouldn't really
need it until just before the call to iowrite32 or __raw_writel.  The
rest of it could use either an smp_wmb() or dma_wmb().

>  static bool mlx4_en_process_tx_cq(struct net_device *dev,
>                                   struct mlx4_en_cq *cq, int napi_budget)
>  {
> @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>         wmb();
>
>         /* we want to dirty this cache line once */
> -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> +       ring_cons += txbbs_skipped;
> +       WRITE_ONCE(ring->cons, ring_cons);
> +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> +
> +       if (dev->doorbell_opt)
> +               mlx4_en_xmit_doorbell(dev, ring);
>

It might be more readable to put the addition on ring_cons out in
front of all the WRITE_ONCE macro calls.

>         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
>                 return done < budget;
> @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
>         __iowrite64_copy(dst, src, bytecnt / 8);
>  }
>
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> -{
> -       wmb();
> -       /* Since there is no iowrite*_native() that writes the
> -        * value as is, without byteswapping - using the one
> -        * the doesn't do byteswapping in the relevant arch
> -        * endianness.
> -        */
> -#if defined(__LITTLE_ENDIAN)
> -       iowrite32(
> -#else
> -       iowrite32be(
> -#endif
> -                 ring->doorbell_qpn,
> -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> -}
>
>  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                                   struct mlx4_en_tx_desc *tx_desc,
>                                   union mlx4_wqe_qpn_vlan qpn_vlan,
>                                   int desc_size, int bf_index,
>                                   __be32 op_own, bool bf_ok,
> -                                 bool send_doorbell)
> +                                 bool send_doorbell,
> +                                 const struct net_device *dev, int nr_txbb)
>  {
>         tx_desc->ctrl.qpn_vlan = qpn_vlan;
>
> @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>
>                 wmb();
>
> +               ring->prod += nr_txbb;
>                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
>                              desc_size);
>
> @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                  */
>                 dma_wmb();
>                 tx_desc->ctrl.owner_opcode = op_own;
> +               ring->prod += nr_txbb;
>                 if (send_doorbell)
> -                       mlx4_en_xmit_doorbell(ring);
> +                       mlx4_en_xmit_doorbell(dev, ring);
>                 else
>                         ring->xmit_more++;
>         }
> @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>         }
>
> -       ring->prod += nr_txbb;
> -
>         /* If we used a bounce buffer then copy descriptor back into place */
>         if (unlikely(bounce))
>                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>         }
>         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>
> +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> +        * will happen shortly.
> +        */
> +       if (send_doorbell &&
> +           dev->doorbell_opt &&
> +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)
> +               send_doorbell = false;
> +
>         real_size = (real_size / 16) & 0x3f;
>
>         bf_ok &= desc_size <= MAX_BF && send_doorbell;
> @@ -1043,7 +1076,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>
>         if (unlikely(stop_queue)) {
>                 /* If queue was emptied after the if (stop_queue) , and before
> @@ -1054,7 +1087,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                  */
>                 smp_rmb();
>
> -               ring_cons = ACCESS_ONCE(ring->cons);
>                 if (unlikely(!mlx4_en_is_tx_ring_full(ring))) {
>                         netif_tx_wake_queue(ring->tx_queue);
>                         ring->wake_queue++;
> @@ -1158,8 +1190,6 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>         rx_ring->xdp_tx++;
>         AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
>
> -       ring->prod += nr_txbb;
> -
>         stop_queue = mlx4_en_is_tx_ring_full(ring);
>         send_doorbell = stop_queue ||
>                                 *doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
> @@ -1173,7 +1203,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>         *doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
>
>         return NETDEV_TX_OK;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 574bcbb1b38f..c3fd0deda198 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
>          */
>         u32                     last_nr_txbb;
>         u32                     cons;
> +       u32                     ncons;
>         unsigned long           wake_queue;
>         struct netdev_queue     *tx_queue;
>         u32                     (*free_tx_desc)(struct mlx4_en_priv *priv,
> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
>
>         /* cache line used and dirtied in mlx4_en_xmit() */
>         u32                     prod ____cacheline_aligned_in_smp;
> +       u32                     prod_bell;
>         unsigned int            tx_dropped;
>         unsigned long           bytes;
>         unsigned long           packets;
> @@ -699,7 +701,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                                struct mlx4_en_rx_alloc *frame,
>                                struct net_device *dev, unsigned int length,
>                                int tx_ind, int *doorbell_pending);
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
> +void mlx4_en_xmit_doorbell(const struct net_device *dev, struct mlx4_en_tx_ring *ring);
>  bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
>                         struct mlx4_en_rx_alloc *frame);
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4ffcd874cc20..39565b5425a6 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1816,6 +1816,7 @@ struct net_device {
>         DECLARE_HASHTABLE       (qdisc_hash, 4);
>  #endif
>         unsigned long           tx_queue_len;
> +       unsigned long           doorbell_opt;
>         spinlock_t              tx_global_lock;
>         int                     watchdog_timeo;
>
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index b0c04cf4851d..df05f81f5150 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -367,6 +367,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
>  }
>  NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>
> +static int change_doorbell_opt(struct net_device *dev, unsigned long val)
> +{
> +       dev->doorbell_opt = val;
> +       return 0;
> +}
> +
> +static ssize_t doorbell_opt_store(struct device *dev,
> +                                 struct device_attribute *attr,
> +                                 const char *buf, size_t len)
> +{
> +       if (!capable(CAP_NET_ADMIN))
> +               return -EPERM;
> +
> +       return netdev_store(dev, attr, buf, len, change_doorbell_opt);
> +}
> +NETDEVICE_SHOW_RW(doorbell_opt, fmt_ulong);
> +
>  static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
>                              const char *buf, size_t len)
>  {
> @@ -531,6 +548,7 @@ static struct attribute *net_class_attrs[] = {
>         &dev_attr_phys_port_name.attr,
>         &dev_attr_phys_switch_id.attr,
>         &dev_attr_proto_down.attr,
> +       &dev_attr_doorbell_opt.attr,
>         NULL,
>  };
>  ATTRIBUTE_GROUPS(net_class);
>
>

^ permalink raw reply

* Re: [PATCH iproute2] ip: update link types to show 6lowpan and ieee802.15.4 monitor
From: Stefan Schmidt @ 2016-12-01 21:31 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev@vger.kernel.org, linux-wpan@vger.kernel.org
In-Reply-To: <1477647723-14641-1-git-send-email-stefan@datenfreihafen.org>

Hello.

On 28.10.2016 11:42, Stefan Schmidt wrote:
> Both types have been missing here and thus ip always showed
> only the numbers.
> 
> Based on a suggestion from Alexander Aring.
> 
> Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>

Did you somehow mangle this patch manually?

Looking at the patch in your git repo it shows no author name but
instead just my patch was send with git format-patch and git send-email
as usual and shows the right author. Was there something worn on my side
or yours? Just checking to avoid it in the future.

http://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=8ae2c5382bd9d98a8f7ddcb1faad1a978d773909

regards
Stefan Schmidt

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Hannes Frederic Sowa @ 2016-12-01 21:27 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Thomas Graf, Florian Westphal, Linux Kernel Network Developers
In-Reply-To: <CALx6S36=Y0dSux+-omWXZKtxb73_g4+DXhPBawb79+UF6rp9rw@mail.gmail.com>

On 01.12.2016 22:12, Tom Herbert wrote:
> On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Hello,
>>
>> this is a good conversation and I simply want to bring my worries
>> across. I don't have good solutions for the problems XDP tries to solve
>> but I fear we could get caught up in maintenance problems in the long
>> term given the ideas floating around on how to evolve XDP currently.
>>
>> On 01.12.2016 17:28, Thomas Graf wrote:
>>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>>>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>>>> XDP manipulates packets at free will and thus all security guarantees
>>>> are off as well as in any user space solution.
>>>>
>>>> Secondly user space provides policy, acl, more controlled memory
>>>> protection, restartability and better debugability. If I had multi
>>>> tenant workloads I would definitely put more complex "business/acl"
>>>> logic into user space, so I can make use of LSM and other features to
>>>> especially prevent a network facing service to attack the tenants. If
>>>> stuff gets put into the kernel you run user controlled code in the
>>>> kernel exposing a much bigger attack vector.
>>>>
>>>> What use case do you see in XDP specifically e.g. for container networking?
>>>
>>> DDOS mitigation to protect distributed applications in large clusters.
>>> Relying on CDN works to protect API gateways and frontends (as long as
>>> they don't throw you out of their network) but offers no protection
>>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>>> level and allowing the mitigation capability to scale up with the number
>>> of servers is natural and cheap.
>>
>> So far we e.g. always considered L2 attacks a problem of the network
>> admin to correctly protect the environment. Are you talking about
>> protecting the L3 data plane? Are there custom proprietary protocols in
>> place which need custom protocol parsers that need involvement of the
>> kernel before it could verify the packet?
>>
>> In the past we tried to protect the L3 data plane as good as we can in
>> Linux to allow the plain old server admin to set an IP address on an
>> interface and install whatever software in user space. We try not only
>> to protect it but also try to achieve fairness by adding a lot of
>> counters everywhere. Are protections missing right now or are we talking
>> about better performance?
>>
> The technical plenary at last IETF on Seoul a couple of weeks ago was
> exclusively focussed on DDOS in light of the recent attack against
> Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
> presentation by Nick Sullivan
> (https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
> alluded to some implementation of DDOS mitigation. In particular, on
> slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
> numbers he gave we're based in iptables+BPF and that was a whole
> 1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
> and that's also when I introduced XDP to whole IETF :-) ). If that's
> the best we can do the Internet is in a world hurt. DDOS mitigation
> alone is probably a sufficient motivation to look at XDP. We need
> something that drops bad packets as quickly as possible when under
> attack, we need this to be integrated into the stack, we need it to be
> programmable to deal with the increasing savvy of attackers, and we
> don't want to be forced to be dependent on HW solutions. This is why
> we created XDP!

I totally understand that. But in my reply to David in this thread I
mentioned DNS apex processing as being problematic which is actually
being referred in your linked slide deck on page 9 ("What do floods look
like") and the problematic of parsing DNS packets in XDP due to string
processing and looping inside eBPF.

Not to mention the fact that you might have to deal with fragments in
the Internet. Some DOS mitigations were already abused to generate
blackholes for other users. Filtering such stuff is quite complicated.

I argued also under the aspect of what Thomas said, that the outside
world of the cluster is already protected by a CDN.

Bye,
Hannes

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Ido Schimmel @ 2016-12-01 21:21 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, jiri, netdev, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <343eadfa-f872-788d-748c-a195c0c4d03a@stressinduktion.org>

On Thu, Dec 01, 2016 at 10:09:19PM +0100, Hannes Frederic Sowa wrote:
> On 01.12.2016 21:54, Ido Schimmel wrote:
> > On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote:
> >> On 01.12.2016 21:04, David Miller wrote:
> >>>
> >>> Hannes and Ido,
> >>>
> >>> It looks like we are very close to having this in mergable shape, can
> >>> you guys work out this final issue and figure out if it really is
> >>> a merge stopped or not?
> >>
> >> Sure, if the fib notification register could be done under protection of
> >> the sequence counter I don't see any more problems.
> > 
> > Did you maybe miss my reply yesterday? Because I was trying to
> > understand what "ordering" you're referring to, but didn't receive a
> > reply from you.
> 
> Oh, strange, I am pretty sure I replied to that. Let me resend it.

:)

I did get this reply, and then replied myself here:
https://marc.info/?l=linux-netdev&m=148053017425465&w=2

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 21:09 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: David Miller, jiri, netdev, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161201205457.xf5evjphjj6mytkf@splinter>

On 01.12.2016 21:54, Ido Schimmel wrote:
> On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote:
>> On 01.12.2016 21:04, David Miller wrote:
>>>
>>> Hannes and Ido,
>>>
>>> It looks like we are very close to having this in mergable shape, can
>>> you guys work out this final issue and figure out if it really is
>>> a merge stopped or not?
>>
>> Sure, if the fib notification register could be done under protection of
>> the sequence counter I don't see any more problems.
> 
> Did you maybe miss my reply yesterday? Because I was trying to
> understand what "ordering" you're referring to, but didn't receive a
> reply from you.

Oh, strange, I am pretty sure I replied to that. Let me resend it.

>> The sync handler is nice to have and can be done in a later patch series.
> 
> Sync handler?

I was talking about SWITCHDEV_SYNC.

Bye,
Hannes

^ permalink raw reply

* Re: [patch] net: renesas: ravb: unintialized return value
From: Sergei Shtylyov @ 2016-12-01 21:13 UTC (permalink / raw)
  To: Dan Carpenter, Johan Hovold
  Cc: David S. Miller, Yoshihiro Kaneko, Kazuya Mizuguchi, Simon Horman,
	Wolfram Sang, Andrew Lunn, Philippe Reynes, Niklas Söderlund,
	Arnd Bergmann, netdev, linux-renesas-soc, kernel-janitors
In-Reply-To: <20161201205744.GB10701@mwanda>

Hello!

On 12/01/2016 11:57 PM, Dan Carpenter wrote:

> We want to set the other "err" variable here so that we can return it
> later.  My version of GCC misses this issue but I caught it with a
> static checker.
>
> Fixes: 9f70eb339f52 ("net: ethernet: renesas: ravb: fix fixed-link phydev leaks")

    Hm, I somehow missed this one, probably due to the horrific CC list. :-(

> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Acked-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>

MBR, Sergei

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Tom Herbert @ 2016-12-01 21:12 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Thomas Graf, Florian Westphal, Linux Kernel Network Developers
In-Reply-To: <583b8947-3395-8529-933b-08e1a86a0778@stressinduktion.org>

On Thu, Dec 1, 2016 at 12:44 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hello,
>
> this is a good conversation and I simply want to bring my worries
> across. I don't have good solutions for the problems XDP tries to solve
> but I fear we could get caught up in maintenance problems in the long
> term given the ideas floating around on how to evolve XDP currently.
>
> On 01.12.2016 17:28, Thomas Graf wrote:
>> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>>> XDP manipulates packets at free will and thus all security guarantees
>>> are off as well as in any user space solution.
>>>
>>> Secondly user space provides policy, acl, more controlled memory
>>> protection, restartability and better debugability. If I had multi
>>> tenant workloads I would definitely put more complex "business/acl"
>>> logic into user space, so I can make use of LSM and other features to
>>> especially prevent a network facing service to attack the tenants. If
>>> stuff gets put into the kernel you run user controlled code in the
>>> kernel exposing a much bigger attack vector.
>>>
>>> What use case do you see in XDP specifically e.g. for container networking?
>>
>> DDOS mitigation to protect distributed applications in large clusters.
>> Relying on CDN works to protect API gateways and frontends (as long as
>> they don't throw you out of their network) but offers no protection
>> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
>> level and allowing the mitigation capability to scale up with the number
>> of servers is natural and cheap.
>
> So far we e.g. always considered L2 attacks a problem of the network
> admin to correctly protect the environment. Are you talking about
> protecting the L3 data plane? Are there custom proprietary protocols in
> place which need custom protocol parsers that need involvement of the
> kernel before it could verify the packet?
>
> In the past we tried to protect the L3 data plane as good as we can in
> Linux to allow the plain old server admin to set an IP address on an
> interface and install whatever software in user space. We try not only
> to protect it but also try to achieve fairness by adding a lot of
> counters everywhere. Are protections missing right now or are we talking
> about better performance?
>
The technical plenary at last IETF on Seoul a couple of weeks ago was
exclusively focussed on DDOS in light of the recent attack against
Dyn. There were speakers form Cloudflare and Dyn. The Cloudflare
presentation by Nick Sullivan
(https://www.ietf.org/proceedings/97/slides/slides-97-ietf-sessb-how-to-stay-online-harsh-realities-of-operating-in-a-hostile-network-nick-sullivan-01.pdf)
alluded to some implementation of DDOS mitigation. In particular, on
slide 6 Nick gave some numbers for drop rates in DDOS. The "kernel"
numbers he gave we're based in iptables+BPF and that was a whole
1.2Mpps-- somehow that seems ridiculously to me (I said so at the mic
and that's also when I introduced XDP to whole IETF :-) ). If that's
the best we can do the Internet is in a world hurt. DDOS mitigation
alone is probably a sufficient motivation to look at XDP. We need
something that drops bad packets as quickly as possible when under
attack, we need this to be integrated into the stack, we need it to be
programmable to deal with the increasing savvy of attackers, and we
don't want to be forced to be dependent on HW solutions. This is why
we created XDP!

Tom

> To provide fairness you often have to share validated data within the
> kernel and with XDP. This requires consistent lookup methods for sockets
> in the lower level. Those can be exported to XDP via external functions
> and become part of uAPI which will limit our ability to change those
> functions in future. When the discussion started about early demuxing in
> XDP I became really nervous, because suddenly the XDP program has to
> decide correctly which protocol type it has and look in the correct
> socket table for the socket. Different semantics for sockets can apply
> here, e.g. some sockets are RCU managed, some end up using reference
> counts. A wrong decision here would cause havoc in the kernel (XDP
> considers packet as UDP but kernel stack as TCP). Also, who knows that
> we won't have per-cpu socket tables we would keep that as uAPI (this is
> btw. the dragonflyBSD approach to scaling)? Imagine someone writing a
> SIP rewriter in XDP and depending on a coherent view of all sockets even
> if their hash doesn't fit to the one of the queue? Suddenly something
> which was thought of as being only mutable by one CPU becomes global
> again and because of XDP we need to add locking because of uAPI.
>
> This discussion is parallel to the discussion about trace points, which
> are not considered uAPI. If eBPF functions are not considered uAPI then
> eBPF in the network stack will have much less value, because you
> suddenly depend on specific kernel versions again and cannot simply load
> the code into the kernel. The API checks will become very difficult to
> implement, see also the ongoing MODVERSIONS discussions on LKML some
> days back.
>
>>>> I agree with you if the LB is a software based appliance in either a
>>>> dedicated VM or on dedicated baremetal.
>>>>
>>>> The reality is turning out to be different in many cases though, LB
>>>> needs to be performed not only for north south but east west as well.
>>>> So even if I would handle LB for traffic entering my datacenter in user
>>>> space, I will need the same LB for packets from my applications and
>>>> I definitely don't want to move all of that into user space.
>>>
>>> The open question to me is why is programmability needed here.
>>>
>>> Look at the discussion about ECMP and consistent hashing. It is not very
>>> easy to actually write this code correctly. Why can't we just put C code
>>> into the kernel that implements this once and for all and let user space
>>> update the policies?
>>
>> Whatever LB logic is put in place with native C code now is unlikely the
>> logic we need in two years. We can't really predict the future. If it
>> was the case, networking would have been done long ago and we would all
>> be working on self eating ice cream now.
>
> Did LB algorithms on the networking layer change that much?
>
> There is a long history of using consistent hashing for load balancing,
> as e.g. is done in haproxy or F5.
>
>>> Load balancers have to deal correctly with ICMP packets, e.g. they even
>>> have to be duplicated to every ECMP route. This seems to be problematic
>>> to do in eBPF programs due to looping constructs so you end up with
>>> complicated user space anyway.
>>
>> Feel free to implement such complex LBs in user space or natively. It is
>> not required for the majority of use cases. The most popular LBs for
>> application load balancing have no idea of ECMP and require ECMP aware
>> routers to be made redundant itself.
>
> They are already available and e.g. deployed as part of some kubernetes
> stacks as I wrote above.
>
> It is a generally available algorithm which fits a lot of use cases,
> basically every website that wants to shard its sessions can make use of
> it. Also it is independent of ECMP and mostly is implemented in load
> balancers due to its need for a lot of memory.
>
> New algorithms outdate old ones but the core principles will be the same
> and don't require major changes to the interface, e.g. ipvs scheduler.
>
> If we are talking about security features for early drop inside TCP
> streams, like http, you need to have a proper stream reassembly engine.
> Snort e.g. dropped a complete stream of TCP packets if you send a RST
> with the same quadruple but a wrong sequence number. End system didn't
> consider the RST but non synchronized solutions ended up not inspecting
> this flow anymore. How do you handle diverting views on meta data in
> networking protocols? Also look how hard it is to keep e.g. the fib
> table synchronized to the hardware.
>
> In retrospect, I think Tom Herbert's move putting ILA stateless
> translation into the XDP hook wasn't that bad after all. ILA maybe
> hopefully becomes a standard and its implementation is already in the
> kernel so why keep its translator not part of the kernel, too?
>
> TLDR; what I'm trying to argue is that evolution of the network stack is
> problematic with a programmable backplane in the kernel which locks out
> future modifications of the stack in some places. On the other side, if
> we don't add those features we will have a half baked solution and
> people will simply prefer netmap or DPDK.
>
> Bye,
> Hannes
>

^ permalink raw reply

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 21:09 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jiri Pirko, netdev, davem, idosch, eladr, yotamg, nogahf, arkadis,
	ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
	f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130163229.rkxvuwukgg35ktrx@splinter.mtl.com>

On 30.11.2016 17:32, Ido Schimmel wrote:
> Hi Hannes,
> 
> On Wed, Nov 30, 2016 at 04:37:48PM +0100, Hannes Frederic Sowa wrote:
>> On 30.11.2016 11:09, Jiri Pirko wrote:
>>> From: Ido Schimmel <idosch@mellanox.com>
>>>
>>> Make sure the device has a complete view of the FIB tables by invoking
>>> their dump during module init.
>>>
>>> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
>>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>>> ---
>>>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
>>>  1 file changed, 23 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> index 14bed1d..d176047 100644
>>> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
>>> @@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
>>>  	return NOTIFY_DONE;
>>>  }
>>>  
>>> +static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
>>> +{
>>> +	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
>>> +
>>> +	/* Flush pending FIB notifications and then flush the device's
>>> +	 * table before requesting another dump. Do that with RTNL held,
>>> +	 * as FIB notification block is already registered.
>>> +	 */
>>> +	mlxsw_core_flush_owq();
>>> +	rtnl_lock();
>>> +	mlxsw_sp_router_fib_flush(mlxsw_sp);
>>> +	rtnl_unlock();
>>> +}
>>> +
>>>  int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>  {
>>> +	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
>>>  	int err;
>>>  
>>>  	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
>>> @@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
>>>  
>>>  	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
>>>  	register_fib_notifier(&mlxsw_sp->fib_nb);
>>
>> Sorry to pick in here again:
>>
>> There is a race here. You need to protect the registration of the fib
>> notifier as well by the sequence counter. Updates here are not ordered
>> in relation to this code below.
> 
> You mean updates that can be received after you registered the notifier
> and until the dump started? I'm aware of that and that's OK. This
> listener should be able to handle duplicates.

I am not concerned about duplicates, but about ordering deletes and
getting an add from the RCU code you will add the node to hw while it is
deleted in the software path. You probably will ignore the delete
because nothing is installed in hw and later add the node which was
actually deleted but just reordered which happend on another CPU, no?

> I've a follow up patchset that introduces a new event in switchdev
> notification chain called SWITCHDEV_SYNC, which is sent when port
> netdevs are enslaved / released  from a master device (points in time
> where kernel<->device can get out of sync). It will invoke
> re-propagation of configuration from different parts of the stack
> (e.g. bridge driver, 8021q driver, fib/neigh code), which can result
> in duplicates.

Okay, understood. I wonder how we can protect against accidentally abort
calls actually. E.g. if I start to inject routes into my routing domain
how can I make sure the box doesn't die after I try to insert enough
routes. Do we need to touch quagga etc?

Thanks,
Hannes

^ permalink raw reply

* [patch] net: renesas: ravb: unintialized return value
From: Dan Carpenter @ 2016-12-01 20:57 UTC (permalink / raw)
  To: Sergei Shtylyov, Johan Hovold
  Cc: David S. Miller, Yoshihiro Kaneko, Kazuya Mizuguchi, Simon Horman,
	Wolfram Sang, Andrew Lunn, Philippe Reynes, Niklas Söderlund,
	Arnd Bergmann, netdev, linux-renesas-soc, kernel-janitors

We want to set the other "err" variable here so that we can return it
later.  My version of GCC misses this issue but I caught it with a
static checker.

Fixes: 9f70eb339f52 ("net: ethernet: renesas: ravb: fix fixed-link phydev leaks")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
Applies to the net tree for 4.10.

diff --git a/drivers/net/ethernet/renesas/ravb_main.c b/drivers/net/ethernet/renesas/ravb_main.c
index 2c0357c..92d7692 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c
@@ -1016,8 +1016,6 @@ static int ravb_phy_init(struct net_device *ndev)
 	 * at this time.
 	 */
 	if (priv->chip_id == RCAR_GEN3) {
-		int err;
-
 		err = phy_set_max_speed(phydev, SPEED_100);
 		if (err) {
 			netdev_err(ndev, "failed to limit PHY to 100Mbit/s\n");

^ permalink raw reply related

* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Ido Schimmel @ 2016-12-01 20:54 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David Miller, jiri, netdev, idosch, eladr, yotamg, nogahf,
	arkadis, ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot,
	andrew, f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <d70996c1-be4e-136b-6325-1a8c152e44ce@stressinduktion.org>

On Thu, Dec 01, 2016 at 09:40:48PM +0100, Hannes Frederic Sowa wrote:
> On 01.12.2016 21:04, David Miller wrote:
> > 
> > Hannes and Ido,
> > 
> > It looks like we are very close to having this in mergable shape, can
> > you guys work out this final issue and figure out if it really is
> > a merge stopped or not?
> 
> Sure, if the fib notification register could be done under protection of
> the sequence counter I don't see any more problems.

Did you maybe miss my reply yesterday? Because I was trying to
understand what "ordering" you're referring to, but didn't receive a
reply from you.

> The sync handler is nice to have and can be done in a later patch series.

Sync handler?

^ permalink raw reply

* Re: [PATCH net] tcp: warn on bogus MSS and try to amend it
From: marcelo.leitner @ 2016-12-01 20:46 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, jmaxwell37, alexandre.sidorenko, kuznet, jmorris,
	yoshfuji, kaber, tlfalcon, brking, eric.dumazet
In-Reply-To: <20161201.152949.1953888486413180001.davem@davemloft.net>

On Thu, Dec 01, 2016 at 03:29:49PM -0500, David Miller wrote:
> From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> Date: Wed, 30 Nov 2016 11:14:32 -0200
> 
> > There have been some reports lately about TCP connection stalls caused
> > by NIC drivers that aren't setting gso_size on aggregated packets on rx
> > path. This causes TCP to assume that the MSS is actually the size of the
> > aggregated packet, which is invalid.
> > 
> > Although the proper fix is to be done at each driver, it's often hard
> > and cumbersome for one to debug, come to such root cause and report/fix
> > it.
> > 
> > This patch amends this situation in two ways. First, it adds a warning
> > on when this situation occurs, so it gives a hint to those trying to
> > debug this. It also limit the maximum probed MSS to the adverised MSS,
> > as it should never be any higher than that.
> > 
> > The result is that the connection may not have the best performance ever
> > but it shouldn't stall, and the admin will have a hint on what to look
> > for.
> > 
> > Tested with virtio by forcing gso_size to 0.
> > 
> > Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
> > Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> 
> I totally agree with this change, however I think the warning message can
> be improved in two ways:
> 
> >  	len = skb_shinfo(skb)->gso_size ? : skb->len;
> >  	if (len >= icsk->icsk_ack.rcv_mss) {
> > -		icsk->icsk_ack.rcv_mss = len;
> > +		icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> > +					       tcp_sk(sk)->advmss);
> > +		if (icsk->icsk_ack.rcv_mss != len)
> > +			pr_warn_once("Seems your NIC driver is doing bad RX acceleration. TCP performance may be compromised.\n");
> 
> We know it's a bad GRO implementation that causes this so let's be specific in the
> message, perhaps something like:
> 
> 	Driver has suspect GRO implementation, TCP performance may be compromised.

Okay.

> 
> Also, we have skb->dev available here most likely, so prefixing the message with
> skb->dev->name would make analyzing this situation even easier for someone hitting
> this.

Nice, yes.
And this skb is mostly non-forwardable as it's bigger than the MTU,
so if someone is using net namespaces and this skb would be routed
through some veth interfaces, it would give a false hint then, but
shouldn't happen. Unless it would fit (a larger) veth mtu, but still,
one probably will simplify things up to debug this.

> 
> I'm not certain if an skb->dev==NULL check is necessary here or not, but it is
> definitely something you need to consider.
> 
> Thanks!
> 

Will check. Thanks!

  Marcelo

^ permalink raw reply

* Re: [flamebait] xdp, well meaning but pointless
From: Hannes Frederic Sowa @ 2016-12-01 20:44 UTC (permalink / raw)
  To: Thomas Graf; +Cc: Florian Westphal, netdev
In-Reply-To: <20161201162814.GA31300@pox.localdomain>

Hello,

this is a good conversation and I simply want to bring my worries
across. I don't have good solutions for the problems XDP tries to solve
but I fear we could get caught up in maintenance problems in the long
term given the ideas floating around on how to evolve XDP currently.

On 01.12.2016 17:28, Thomas Graf wrote:
> On 12/01/16 at 04:52pm, Hannes Frederic Sowa wrote:
>> First of all, this is a rant targeted at XDP and not at eBPF as a whole.
>> XDP manipulates packets at free will and thus all security guarantees
>> are off as well as in any user space solution.
>>
>> Secondly user space provides policy, acl, more controlled memory
>> protection, restartability and better debugability. If I had multi
>> tenant workloads I would definitely put more complex "business/acl"
>> logic into user space, so I can make use of LSM and other features to
>> especially prevent a network facing service to attack the tenants. If
>> stuff gets put into the kernel you run user controlled code in the
>> kernel exposing a much bigger attack vector.
>>
>> What use case do you see in XDP specifically e.g. for container networking?
> 
> DDOS mitigation to protect distributed applications in large clusters.
> Relying on CDN works to protect API gateways and frontends (as long as
> they don't throw you out of their network) but offers no protection
> beyond that, e.g. a noisy/hostile neighbour. Doing this at the server
> level and allowing the mitigation capability to scale up with the number
> of servers is natural and cheap.

So far we e.g. always considered L2 attacks a problem of the network
admin to correctly protect the environment. Are you talking about
protecting the L3 data plane? Are there custom proprietary protocols in
place which need custom protocol parsers that need involvement of the
kernel before it could verify the packet?

In the past we tried to protect the L3 data plane as good as we can in
Linux to allow the plain old server admin to set an IP address on an
interface and install whatever software in user space. We try not only
to protect it but also try to achieve fairness by adding a lot of
counters everywhere. Are protections missing right now or are we talking
about better performance?

To provide fairness you often have to share validated data within the
kernel and with XDP. This requires consistent lookup methods for sockets
in the lower level. Those can be exported to XDP via external functions
and become part of uAPI which will limit our ability to change those
functions in future. When the discussion started about early demuxing in
XDP I became really nervous, because suddenly the XDP program has to
decide correctly which protocol type it has and look in the correct
socket table for the socket. Different semantics for sockets can apply
here, e.g. some sockets are RCU managed, some end up using reference
counts. A wrong decision here would cause havoc in the kernel (XDP
considers packet as UDP but kernel stack as TCP). Also, who knows that
we won't have per-cpu socket tables we would keep that as uAPI (this is
btw. the dragonflyBSD approach to scaling)? Imagine someone writing a
SIP rewriter in XDP and depending on a coherent view of all sockets even
if their hash doesn't fit to the one of the queue? Suddenly something
which was thought of as being only mutable by one CPU becomes global
again and because of XDP we need to add locking because of uAPI.

This discussion is parallel to the discussion about trace points, which
are not considered uAPI. If eBPF functions are not considered uAPI then
eBPF in the network stack will have much less value, because you
suddenly depend on specific kernel versions again and cannot simply load
the code into the kernel. The API checks will become very difficult to
implement, see also the ongoing MODVERSIONS discussions on LKML some
days back.

>>> I agree with you if the LB is a software based appliance in either a
>>> dedicated VM or on dedicated baremetal.
>>>
>>> The reality is turning out to be different in many cases though, LB
>>> needs to be performed not only for north south but east west as well.
>>> So even if I would handle LB for traffic entering my datacenter in user
>>> space, I will need the same LB for packets from my applications and
>>> I definitely don't want to move all of that into user space.
>>
>> The open question to me is why is programmability needed here.
>>
>> Look at the discussion about ECMP and consistent hashing. It is not very
>> easy to actually write this code correctly. Why can't we just put C code
>> into the kernel that implements this once and for all and let user space
>> update the policies?
> 
> Whatever LB logic is put in place with native C code now is unlikely the
> logic we need in two years. We can't really predict the future. If it
> was the case, networking would have been done long ago and we would all
> be working on self eating ice cream now.

Did LB algorithms on the networking layer change that much?

There is a long history of using consistent hashing for load balancing,
as e.g. is done in haproxy or F5.

>> Load balancers have to deal correctly with ICMP packets, e.g. they even
>> have to be duplicated to every ECMP route. This seems to be problematic
>> to do in eBPF programs due to looping constructs so you end up with
>> complicated user space anyway.
> 
> Feel free to implement such complex LBs in user space or natively. It is
> not required for the majority of use cases. The most popular LBs for
> application load balancing have no idea of ECMP and require ECMP aware
> routers to be made redundant itself.

They are already available and e.g. deployed as part of some kubernetes
stacks as I wrote above.

It is a generally available algorithm which fits a lot of use cases,
basically every website that wants to shard its sessions can make use of
it. Also it is independent of ECMP and mostly is implemented in load
balancers due to its need for a lot of memory.

New algorithms outdate old ones but the core principles will be the same
and don't require major changes to the interface, e.g. ipvs scheduler.

If we are talking about security features for early drop inside TCP
streams, like http, you need to have a proper stream reassembly engine.
Snort e.g. dropped a complete stream of TCP packets if you send a RST
with the same quadruple but a wrong sequence number. End system didn't
consider the RST but non synchronized solutions ended up not inspecting
this flow anymore. How do you handle diverting views on meta data in
networking protocols? Also look how hard it is to keep e.g. the fib
table synchronized to the hardware.

In retrospect, I think Tom Herbert's move putting ILA stateless
translation into the XDP hook wasn't that bad after all. ILA maybe
hopefully becomes a standard and its implementation is already in the
kernel so why keep its translator not part of the kernel, too?

TLDR; what I'm trying to argue is that evolution of the network stack is
problematic with a programmable backplane in the kernel which locks out
future modifications of the stack in some places. On the other side, if
we don't add those features we will have a half baked solution and
people will simply prefer netmap or DPDK.

Bye,
Hannes

^ permalink raw reply

* Re: [net PATCH 0/2] Don't use lco_csum to compute IPv4 checksum
From: David Miller @ 2016-12-01 20:41 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: alexander.h.duyck, netdev, intel-wired-lan, sfr
In-Reply-To: <1480540522.2377.18.camel@intel.com>

From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 30 Nov 2016 13:15:22 -0800

> On Wed, 2016-11-30 at 09:47 -0500, David Miller wrote:
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>> Date: Mon, 28 Nov 2016 10:42:18 -0500
>> 
>> > When I implemented the GSO partial support in the Intel drivers I was
>> using
>> > lco_csum to compute the checksum that we needed to plug into the IPv4
>> > checksum field in order to cancel out the data that was not a part of
>> the
>> > IPv4 header.  However this didn't take into account that the transport
>> > offset might be pointing to the inner transport header.
>> > 
>> > Instead of using lco_csum I have just coded around it so that we can
>> use
>> > the outer IP header plus the IP header length to determine where we
>> need to
>> > start our checksum and then just call csum_partial ourselves.
>> > 
>> > This should fix the SIT issue reported on igb interfaces as well as
>> simliar
>> > issues that would pop up on other Intel NICs.
>> 
>> Jeff, are you going to send me a pull request with this stuff or would
>> you be OK with my applying these directly to 'net'?
> 
> Go ahead and apply those to your net tree, I do not want to hold this up.

Ok, done, thanks Jeff.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox