* Re: [PATCH v2] tun: Use netif_receive_skb instead of netif_rx
From: David Miller @ 2016-12-01 19:43 UTC (permalink / raw)
To: andreyknvl
Cc: herbert, jasowang, edumazet, pmk, pabeni, mst, soheil, elfring,
rppt, netdev, linux-kernel, dvyukov, kcc, syzkaller
In-Reply-To: <1480584880-48651-1-git-send-email-andreyknvl@google.com>
From: Andrey Konovalov <andreyknvl@google.com>
Date: Thu, 1 Dec 2016 10:34:40 +0100
> This patch changes tun.c to call netif_receive_skb instead of netif_rx
> when a packet is received (if CONFIG_4KSTACKS is not enabled to avoid
> stack exhaustion). The difference between the two is that netif_rx queues
> the packet into the backlog, and netif_receive_skb proccesses the packet
> in the current context.
>
> This patch is required for syzkaller [1] to collect coverage from packet
> receive paths, when a packet being received through tun (syzkaller collects
> coverage per process in the process context).
>
> As mentioned by Eric this change also speeds up tun/tap. As measured by
> Peter it speeds up his closed-loop single-stream tap/OVS benchmark by
> about 23%, from 700k packets/second to 867k packets/second.
>
> A similar patch was introduced back in 2010 [2, 3], but the author found
> out that the patch doesn't help with the task he had in mind (for cgroups
> to shape network traffic based on the original process) and decided not to
> go further with it. The main concern back then was about possible stack
> exhaustion with 4K stacks.
>
> [1] https://github.com/google/syzkaller
>
> [2] https://www.spinics.net/lists/netdev/thrd440.html#130570
>
> [3] https://www.spinics.net/lists/netdev/msg130570.html
>
> Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
> ---
>
> Changes since v1:
> - incorporate Eric's note about speed improvements in commit description
> - use netif_receive_skb CONFIG_4KSTACKS is not enabled
Applied to net-next, thanks!
^ permalink raw reply
* Re: Initial thoughts on TXDP
From: Rick Jones @ 2016-12-01 19:48 UTC (permalink / raw)
To: Tom Herbert, Sowmini Varadhan; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S35DCyi_2z1pqCLaB1bVyNykP_J3YaYEXUT8xxmuzyBDwA@mail.gmail.com>
On 12/01/2016 11:05 AM, Tom Herbert wrote:
> For the GSO and GRO the rationale is that performing the extra SW
> processing to do the offloads is significantly less expensive than
> running each packet through the full stack. This is true in a
> multi-layered generalized stack. In TXDP, however, we should be able
> to optimize the stack data path such that that would no longer be
> true. For instance, if we can process the packets received on a
> connection quickly enough so that it's about the same or just a little
> more costly than GRO processing then we might bypass GRO entirely.
> TSO is probably still relevant in TXDP since it reduces overheads
> processing TX in the device itself.
Just how much per-packet path-length are you thinking will go away under
the likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO
does some non-trivial things to effective overhead (service demand) and
so throughput:
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt --
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv Send Send Utilization Service
Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local
remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
87380 16384 16384 10.00 9260.24 2.02 -1.00 0.428
-1.000
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt --
-P 12867
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv Send Send Utilization Service
Demand
Socket Socket Message Elapsed Send Recv Send Recv
Size Size Size Time Throughput local remote local
remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
87380 16384 16384 10.00 5621.82 4.25 -1.00 1.486
-1.000
And that is still with the stretch-ACKs induced by GRO at the receiver.
Losing GRO has quite similar results:
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv Send Send Utilization Service
Demand
Socket Socket Message Elapsed Recv Send Recv Send
Size Size Size Time Throughput local remote local
remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
87380 16384 16384 10.00 9154.02 4.00 -1.00 0.860
-1.000
stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
TCP_MAERTS -- -P 12867
MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
Recv Send Send Utilization Service
Demand
Socket Socket Message Elapsed Recv Send Recv Send
Size Size Size Time Throughput local remote local
remote
bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
87380 16384 16384 10.00 4212.06 5.36 -1.00 2.502
-1.000
I'm sure there is a very non-trivial "it depends" component here -
netperf will get the peak benefit from *SO and so one will see the peak
difference in service demands - but even if one gets only 6 segments per
*SO that is a lot of path-length to make-up.
4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz
And even if one does have the CPU cycles to burn so to speak, the effect
on power consumption needs to be included in the calculus.
happy benchmarking,
rick jones
^ permalink raw reply
* Re: [PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.
From: Pravin Shelar @ 2016-12-01 19:50 UTC (permalink / raw)
To: Jiri Benc; +Cc: Jarno Rajahalme, Linux Kernel Network Developers, Eric Garver
In-Reply-To: <20161130145159.3cee7ba4@griffin>
On Wed, Nov 30, 2016 at 5:51 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Tue, 29 Nov 2016 15:30:52 -0800, Jarno Rajahalme wrote:
>> @@ -504,11 +485,20 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
>> goto drop;
>> }
>>
>> - if (unlikely(packet_length(skb, vport->dev) > mtu &&
>> - !skb_is_gso(skb))) {
>> - net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
>> - vport->dev->name,
>> - packet_length(skb, vport->dev), mtu);
>> + if (unlikely(!is_skb_forwardable(vport->dev, skb))) {
>
> How does this work when the vlan tag is accelerated? Then we can be
> over MTU, yet the check will pass.
>
This is not changing any behavior compared to current OVS vlan checks.
Single vlan header is not considered for MTU check.
^ permalink raw reply
* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 19:51 UTC (permalink / raw)
To: Florian Westphal; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161201024407.GE26507@breakpoint.cc>
On Wed, Nov 30, 2016 at 6:44 PM, Florian Westphal <fw@strlen.de> wrote:
> Tom Herbert <tom@herbertland.com> wrote:
>> Posting for discussion....
>
> Warning: You are not going to like this reply...
>
>> Now that XDP seems to be nicely gaining traction
>
> Yes, I regret to see that. XDP seems useful to create impressive
> benchmark numbers (and little else).
>
> I will send a separate email to keep that flamebait part away from
> this thread though.
>
> [..]
>
>> addresses the performance gap for stateless packet processing). The
>> problem statement is analogous to that which we had for XDP, namely
>> can we create a mode in the kernel that offer the same performance
>> that is seen with L4 protocols over kernel bypass
>
> Why? If you want to bypass the kernel, then DO IT.
>
I don't want kernel bypass. I want the Linux stack to provide
something close to bare metal performance for TCP/UDP for some latency
sensitive applications we run.
> There is nothing wrong with DPDK. The ONLY problem is that the kernel
> does not offer a userspace fastpath like Windows RIO or FreeBSDs netmap.
>
> But even without that its not difficult to get DPDK running.
>
That is not true for large scale deployments. Also, TXDP is about
accelerating transport layers like TCP, DPDK is just the interface
from userspace to device queues. We need a whole lot more with DPDK, a
userspace TCP/IP stack for instance, to consider that we have an
equivalent functionality.
> (T)XDP seems born from spite, not technical rationale.
> IMO everyone would be better off if we'd just have something netmap-esqe
> in the network core (also see below).
>
>> I imagine there are a few reasons why userspace TCP stacks can get
>> good performance:
>>
>> - Spin polling (we already can do this in kernel)
>> - Lockless, I would assume that threads typically have exclusive
>> access to a queue pair for a connection
>> - Minimal TCP/IP stack code
>> - Zero copy TX/RX
>> - Light weight structures for queuing
>> - No context switches
>> - Fast data path for in order, uncongested flows
>> - Silo'ing between application and device queues
>
> I only see two cases:
>
> 1. Many applications running (standard Os model) that need to
> send/receive data
> -> Linux Network Stack
>
> 2. Single dedicated application that does all rx/tx
>
> -> no queueing needed (can block network rx completely if receiver
> is slow)
> -> no allocations needed at runtime at all
> -> no locking needed (single produce, single consumer)
>
> If you have #2 and you need to be fast etc then full userspace
> bypass is fine. We will -- no matter what we do in kernel -- never
> be able to keep up with the speed you can get with that
> because we have to deal with #1. (Plus the ease of use/freedom of doing
> userspace programming). And yes, I think that #2 is something we
> should address solely by providing netmap or something similar.
>
> But even considering #1 there are ways to speed stack up:
>
> I'd kill RPF/RPS so we don't have IPI anymore and skb stays
> on same cpu up to where it gets queued (ofo or rx queue).
>
The reference to RPS and RFS is only to move packets off the hot CPU
that are not in the datapath. For instance if we get a FIN for a
connection it we can put this into a slow path since FIN processing is
not latency sensitive but may take a considerable amount of CPU to
process.
> Then we could tell driver what happened with the skb it gave us, e.g.
> we can tell driver it can do immediate page/dma reuse, for example
> in pure ack case as opposed to skb sitting in ofo or receive queue.
>
> (RPS/RFS functionality could still be provided via one of the gazillion
> hooks we now have in the stack for those that need/want it), so I do
> not think we would lose functionality.
>
>> - Call into TCP/IP stack with page data directly from driver-- no
>> skbuff allocation or interface. This is essentially provided by the
>> XDP API although we would need to generalize the interface to call
>> stack functions (I previously posted patches for that). We will also
>> need a new action, XDP_HELD?, that indicates the XDP function held the
>> packet (put on a socket for instance).
>
> Seems this will not work at all with the planned page pool thing when
> pages start to be held indefinitely.
>
The processing needed to gift a page to the stack shouldn't be very
different than what a driver needs to do when and skbuff is created
when XDP_PASS is returned. We probably would want to pass additional
meta data, things like checksum and vlan information from received
descriptor to the stack. A callback can be included if the stack
decides it wants to hold on to the buffer and driver needs to do
dma_sync etc. for that.
> You can also never get even close to userspace offload stacks once you
> need/do this; allocations in hotpath are too expensive.
>
> [..]
>
>> - When we transmit, it would be nice to go straight from TCP
>> connection to an XDP device queue and in particular skip the qdisc
>> layer. This follows the principle of low latency being first criteria.
>
> It will never be lower than userspace offloads so anyone with serious
> "low latency" requirement (trading) will use that instead.
>
Maybe, but the question is how close can we get? If we can get within
say 10-20% performance that would be a win.
> Whats your target audience?
>
Many applications, but the most recent one that seems to driving the
need for very low latency is machine learning. The competition here
really isn't DPDK but is still RDMA (tomorrow's technology for the
past twenty years ;-) ). When the apps guys run their tests, they see
a huge difference between RDMA performance and the stack out of the
box-- like latency for an op goes from 1 usec to 30 usecs. So the apps
guys naturally want RDMA, but anyone in kernel or network ops knows
the nightmare that deploying RDMA entails. If we can get the latencies
and variance down to something comparable (say <5 usecs) then we have
much stronger argument that we can avoid the immense costs that RDMA
brings in.
>> longer latencies in effect which likely means TXDP isn't appropriate
>> in such a cases. BQL is also out, however we would want the TX
>> batching of XDP.
>
> Right, congestion control and buffer bloat are totally overrated .. 8-(
>
> So far I haven't seen anything that would need XDP at all.
>
> What makes it technically impossible to apply these miracles to the
> stack...?
>
> E.g. "mini-skb": Even if we assume that this provides a speedup
> (where does that come from? should make no difference if a 32 or
> 320 byte buffer gets allocated).
>
It's the zero'ing of three cache lines. I believe we talked about that
as netdev.
> If we assume that its the zeroing of sk_buff (but iirc it made little
> to no difference), could add
>
> unsigned long skb_extensions[1];
>
> to sk_buff, then move everything not needed for tcp fastpath
> (e.g. secpath, conntrack, nf_bridge, tunnel encap, tc, ...)
> below that
>
Yes, that's the intent.
> Then convert accesses to accessors and init it on demand.
>
> One could probably also split cb[] into a smaller fastpath area
> and another one at the end that won't be touched at allocation time.
>
>> Miscellaneous
>
>> contemplating that connections/sockets can be bound to particularly
>> CPUs and that any operations (socket operations, timers, receive
>> processing) must occur on that CPU. The CPU would be the one where RX
>> happens. Note this implies perfect silo'ing, everything for driver RX
>> to application processing happens inline on the CPU. The stack would
>> not cross CPUs for a connection while in this mode.
>
> Again don't see how this relates to xdp. Could also be done with
> current stack if we make rps/rfs pluggable since nothing else
> currently pushes skb to another cpu (except when scheduler is involved
> via tc mirred, netfilter userspace queueing etc) but that is always
> explicit (i.e. skb leaves softirq protection).
>
> Can we please fix and improve what we already have rather than creating
> yet another NIH thing that will have to be maintained forever?
>
That's what we are doing and this is major reason what we need to
improve Linux as opposed introducing to parallel stacks. The cost for
supporting modifications to Linux pale in comparison to we would need
to support a parallel stack.
Tom
> Thanks.
^ permalink raw reply
* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: David Miller @ 2016-12-01 20:04 UTC (permalink / raw)
To: idosch
Cc: hannes, jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis,
ogerlitz, roopa, dsa, nikolay, andy, vivien.didelot, andrew,
f.fainelli, alexander.h.duyck, kaber
In-Reply-To: <20161130163229.rkxvuwukgg35ktrx@splinter.mtl.com>
Hannes and Ido,
It looks like we are very close to having this in mergable shape, can
you guys work out this final issue and figure out if it really is
a merge stopped or not?
Thanks.
^ permalink raw reply
* Re: [PATCH 2/2] net: rfkill: Add rfkill-any LED trigger
From: Michał Kępień @ 2016-12-01 20:08 UTC (permalink / raw)
To: kbuild test robot
Cc: kbuild-all, Johannes Berg, David S . Miller, linux-wireless,
netdev, linux-kernel
In-Reply-To: <201612020131.aDbI7Mq9%fengguang.wu@intel.com>
> Hi Michał,
>
> [auto build test ERROR on mac80211-next/master]
> [also build test ERROR on v4.9-rc7 next-20161201]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
>
> url: https://github.com/0day-ci/linux/commits/Micha-K-pie/net-rfkill-Cleanup-error-handling-in-rfkill_init/20161202-002119
> base: https://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next.git master
> config: i386-randconfig-x004-201648 (attached as .config)
> compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
> reproduce:
> # save the attached .config to linux build tree
> make ARCH=i386
>
> All errors (new ones prefixed by >>):
>
> net/rfkill/core.c: In function 'rfkill_set_block':
> >> net/rfkill/core.c:354:2: error: implicit declaration of function '__rfkill_any_led_trigger_event' [-Werror=implicit-function-declaration]
> __rfkill_any_led_trigger_event();
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> net/rfkill/core.c: In function 'rfkill_init':
> net/rfkill/core.c:1349:1: warning: label 'error_led_trigger' defined but not used [-Wunused-label]
> error_led_trigger:
> ^~~~~~~~~~~~~~~~~
> At top level:
> net/rfkill/core.c:243:13: warning: 'rfkill_any_led_trigger_unregister' defined but not used [-Wunused-function]
> static void rfkill_any_led_trigger_unregister(void)
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> net/rfkill/core.c:238:12: warning: 'rfkill_any_led_trigger_register' defined but not used [-Wunused-function]
> static int rfkill_any_led_trigger_register(void)
> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> cc1: some warnings being treated as errors
>
> vim +/__rfkill_any_led_trigger_event +354 net/rfkill/core.c
>
> 348 rfkill->state &= ~RFKILL_BLOCK_SW_SETCALL;
> 349 rfkill->state &= ~RFKILL_BLOCK_SW_PREV;
> 350 curr = rfkill->state & RFKILL_BLOCK_SW;
> 351 spin_unlock_irqrestore(&rfkill->lock, flags);
> 352
> 353 rfkill_led_trigger_event(rfkill);
> > 354 __rfkill_any_led_trigger_event();
> 355
> 356 if (prev != curr)
> 357 rfkill_event(rfkill);
Thanks, these are obviously all valid concerns. Sorry for being sloppy
with the ifdefs. If I get positive feedback on the proposed feature
itself, all these issues (and the warning pointed out in the other
message) will be resolved in v2.
--
Best regards,
Michał Kępień
^ permalink raw reply
* Re: [WIP] net+mlx4: auto doorbell
From: Eric Dumazet @ 2016-12-01 20:11 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Saeed Mahameed, Rick Jones, Linux Netdev List, Saeed Mahameed,
Tariq Toukan
In-Reply-To: <20161201201707.5f51a02e@redhat.com>
On Thu, 2016-12-01 at 20:17 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 01 Dec 2016 09:04:17 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> > BTW, if you are doing tests on mlx4 40Gbit,
>
> I'm mostly testing with mlx5 50Gbit, but I do have 40G NIC in the
> machines too.
>
> > would you check the
> > following quick/dirty hack, using lots of low-rate flows ?
>
> What tool should I use to send "low-rate flows"?
>
You could use https://github.com/google/neper
It supports SO_MAX_PACING_RATE, and you could launch 1600 flows, rate
limited to 3028000 bytes per second (so sending one 2-MSS TSO packet
every ms per flow)
> And what am I looking for?
Max throughput, in packets per second :/
^ permalink raw reply
* Re: Initial thoughts on TXDP
From: Sowmini Varadhan @ 2016-12-01 20:13 UTC (permalink / raw)
To: Tom Herbert; +Cc: Linux Kernel Network Developers
In-Reply-To: <CALx6S35DCyi_2z1pqCLaB1bVyNykP_J3YaYEXUT8xxmuzyBDwA@mail.gmail.com>
On (12/01/16 11:05), Tom Herbert wrote:
>
> Polling does not necessarily imply that networking monopolizes the CPU
> except when the CPU is otherwise idle. Presumably the application
> drives the polling when it is ready to receive work.
I'm not grokking that- "if the cpu is idle, we want to busy-poll
and make it 0% idle"? Keeping CPU 0% idle has all sorts
of issues, see slide 20 of
http://www.slideshare.net/shemminger/dpdk-performance
> > and one other critical difference from the hot-potato-forwarding
> > model (the sort of OVS model that DPDK etc might aruguably be a fit for)
> > does not apply: in order to figure out the ethernet and IP headers
> > in the response correctly at all times (in the face of things like VRRP,
> > gw changes, gw's mac addr changes etc) the application should really
> > be listening on NETLINK sockets for modifications to the networking
> > state - again points to needing a select() socket set where you can
> > have both the I/O fds and the netlink socket,
> >
> I would think that that is management would not be implemented in a
> fast path processing thread for an application.
sure, but my point was that *XDP and other stack-bypass methods needs
to provide a select()able socket: when your use-case is not about just
networking, you have to snoop on changes to the control plane, and update
your data path. In the OVS case (pure networking) the OVS control plane
updates are intrinsic to OVS. For the rest of the request/response world,
we need a select()able socket set to do this elegantly (not really
possible in DPDK, for example)
> The *SOs are always an interesting question. They make for great
> benchmarks, but in real life the amount of benefit is somewhat
> unclear. Under the wrong conditions, like all cwnds have collapsed or
I think Rick's already bringing up this one.
--Sowmini
^ permalink raw reply
* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 20:18 UTC (permalink / raw)
To: Rick Jones; +Cc: Sowmini Varadhan, Linux Kernel Network Developers
In-Reply-To: <aac93b13-6298-b9eb-7f3c-b074f22c388c@hpe.com>
On Thu, Dec 1, 2016 at 11:48 AM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 12/01/2016 11:05 AM, Tom Herbert wrote:
>>
>> For the GSO and GRO the rationale is that performing the extra SW
>> processing to do the offloads is significantly less expensive than
>> running each packet through the full stack. This is true in a
>> multi-layered generalized stack. In TXDP, however, we should be able
>> to optimize the stack data path such that that would no longer be
>> true. For instance, if we can process the packets received on a
>> connection quickly enough so that it's about the same or just a little
>> more costly than GRO processing then we might bypass GRO entirely.
>> TSO is probably still relevant in TXDP since it reduces overheads
>> processing TX in the device itself.
>
>
> Just how much per-packet path-length are you thinking will go away under the
> likes of TXDP? It is admittedly "just" netperf but losing TSO/GSO does some
> non-trivial things to effective overhead (service demand) and so throughput:
>
For plain in order TCP packets I believe we should be able process
each packet at nearly same speed as GRO. Most of the protocol
processing we do between GRO and the stack are the same, the
differences are that we need to do a connection lookup in the stack
path (note we now do this is UDP GRO and that hasn't show up as a
major hit). We also need to consider enqueue/dequeue on the socket
which is a major reason to try for lockless sockets in this instance.
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P
> 12867
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv Send Send Utilization Service
> Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local remote
> bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
>
> 87380 16384 16384 10.00 9260.24 2.02 -1.00 0.428 -1.000
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 tso off gso off
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -- -P
> 12867
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv Send Send Utilization Service
> Demand
> Socket Socket Message Elapsed Send Recv Send Recv
> Size Size Size Time Throughput local remote local remote
> bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
>
> 87380 16384 16384 10.00 5621.82 4.25 -1.00 1.486 -1.000
>
> And that is still with the stretch-ACKs induced by GRO at the receiver.
>
Sure, but trying running something emulates a more realistic workload
than a TCP stream, like RR test with relative small payload and many
connections.
> Losing GRO has quite similar results:
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_MAERTS -- -P 12867
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv Send Send Utilization Service
> Demand
> Socket Socket Message Elapsed Recv Send Recv Send
> Size Size Size Time Throughput local remote local remote
> bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
>
> 87380 16384 16384 10.00 9154.02 4.00 -1.00 0.860 -1.000
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ sudo ethtool -K hed0 gro off
>
> stack@np-cp1-c0-m1-mgmt:~/rjones2$ ./netperf -c -H np-cp1-c1-m3-mgmt -t
> TCP_MAERTS -- -P 12867
> MIGRATED TCP MAERTS TEST from 0.0.0.0 (0.0.0.0) port 12867 AF_INET to
> np-cp1-c1-m3-mgmt () port 12867 AF_INET : demo
> Recv Send Send Utilization Service
> Demand
> Socket Socket Message Elapsed Recv Send Recv Send
> Size Size Size Time Throughput local remote local remote
> bytes bytes bytes secs. 10^6bits/s % S % U us/KB us/KB
>
> 87380 16384 16384 10.00 4212.06 5.36 -1.00 2.502 -1.000
>
> I'm sure there is a very non-trivial "it depends" component here - netperf
> will get the peak benefit from *SO and so one will see the peak difference
> in service demands - but even if one gets only 6 segments per *SO that is a
> lot of path-length to make-up.
>
True, but I think there's a lot of path we'll be able to cut out. In
this mode we don't need IPtables, Netfilter, input route, IPvlan
check, or other similar lookups. Once we've successfully matched a
establish TCP state anything related to policy on both TX and RX for
that connection is inferred from the state. We want the processing
path in this case to just be concerned with just protocol processing
and interface to user.
> 4.4 kernel, BE3 NICs ... E5-2640 0 @ 2.50GHz
>
> And even if one does have the CPU cycles to burn so to speak, the effect on
> power consumption needs to be included in the calculus.
>
Definitely, power consumption is the down side of spin polling CPUs.
As I said we would never should be spinning any more CPUs than
necessary to handle the load.
Tom
> happy benchmarking,
>
> rick jones
^ permalink raw reply
* Re: [WIP] net+mlx4: auto doorbell
From: David Miller @ 2016-12-01 20:20 UTC (permalink / raw)
To: eric.dumazet; +Cc: brouer, saeedm, rick.jones2, netdev, saeedm, tariqt
In-Reply-To: <1480611857.18162.319.camel@edumazet-glaptop3.roam.corp.google.com>
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 01 Dec 2016 09:04:17 -0800
> On Thu, 2016-12-01 at 17:04 +0100, Jesper Dangaard Brouer wrote:
>
>> When qdisc layer or trafgen/af_packet see this indication it knows it
>> should/must flush the queue when it don't have more work left. Perhaps
>> through net_tx_action(), by registering itself and e.g. if qdisc_run()
>> is called and queue is empty then check if queue needs a flush. I would
>> also allow driver to flush and clear this bit.
>
> net_tx_action() is not normally called, unless BQL limit is hit and/or
> some qdiscs with throttling (HTB, TBF, FQ, ...)
The one thing I wonder about is whether we should "ramp up" into a mode
where the NAPI poll does the doorbells instead of going directly there.
Maybe I misunderstand your algorithm, but it looks to me like if there
are any active packets in the TX queue at enqueue time you will defer
the doorbell to the interrupt handler.
Let's say we put 1 packet in, and hit the doorbell.
Then another packet comes in and we defer the doorbell to the IRQ.
At this point there are a couple things I'm unclear about.
For example, if we didn't hit the doorbell, will the chip still take a
peek at the second descriptor? Depending upon how the doorbell works
it might, or it might not.
Either way, wouldn't there be a possible condition where the chip
wouldn't see the second enqueued packet and we'd thus have the wire
idle until the interrupt + NAPI runs and hits the doorbell?
This is why I think we should "ramp up" the doorbell deferral, in
order to avoid this potential wire idle time situation.
Maybe the situation I'm worried about is not possible, so please
explain it to me :-)
^ permalink raw reply
* Re: [RFC net-next 0/3] net: bridge: Allow CPU port configuration
From: Florian Fainelli @ 2016-12-01 20:21 UTC (permalink / raw)
To: Ido Schimmel; +Cc: idosch, andrew, vivien.didelot, netdev, bridge, jiri, davem
In-Reply-To: <20161123134856.cwk6sznnwa7p4xtq@splinter.mtl.com>
On 11/23/2016 05:48 AM, Ido Schimmel wrote:
> Hi Florian,
>
> On Tue, Nov 22, 2016 at 09:56:30AM -0800, Florian Fainelli wrote:
>> On 11/22/2016 09:41 AM, Ido Schimmel wrote:
>>> Hi Florian,
>>>
>>> On Mon, Nov 21, 2016 at 11:09:22AM -0800, Florian Fainelli wrote:
>>>> Hi all,
>>>>
>>>> This patch series allows using the bridge master interface to configure
>>>> an Ethernet switch port's CPU/management port with different VLAN attributes than
>>>> those of the bridge downstream ports/members.
>>>>
>>>> Jiri, Ido, Andrew, Vivien, please review the impact on mlxsw and mv88e6xxx, I
>>>> tested this with b53 and a mockup DSA driver.
>>>
>>> We'll need to add a check in mlxsw and ignore any VLAN configuration for
>>> the bridge device itself. Otherwise, any configuration done on br0 will
>>> be propagated to all of its slaves, which is incorrect.
>>>
>>>>
>>>> Open questions:
>>>>
>>>> - if we have more than one bridge on top of a physical switch, the driver
>>>> should keep track of that and verify that we are not going to change
>>>> the CPU port VLAN attributes in a way that results in incompatible settings
>>>> to be applied
>>>>
>>>> - if the default behavior is to have all VLANs associated with the CPU port
>>>> be ingressing/egressing tagged to the CPU, is this really useful?
>>>
>>> First of all, I want to be sure that when we say "CPU port", we're
>>> talking about the same thing. In mlxsw, the CPU port is a pipe between
>>> the device and the host, through which all packets trapped to the host
>>> go through. So, when a packet is trapped, the driver reads its Rx
>>> descriptor, checks through which port it ingressed, resolves its netdev,
>>> sets skb->dev accordingly and injects it to the Rx path via
>>> netif_receive_skb(). The CPU port itself isn't represented using a
>>> netdev.
>>
>> In the case of DSA, the CPU port is a normal Ethernet MAC driver, but in
>> premise, this driver plus the DSA tag protocol hook do exactly the same
>> things as you just describe.
>
> Thanks for the detailed explanation! I also took the time to read
> dsa.txt, however I still don't quite understand the motivation for
> VLAN filtering on the CPU port. In which cases would you like to prevent
> packets from going to the host due to their VLAN header? This change
> would make sense to me if you only had a limited number of VLANs you
> could enable on the CPU port, but from your response I understand that
> this isn't the case.
It's not much about VLAN filtering per-se, but more about the default
VLAN membership of the CPU port, in the absence of any explicit
configuration. As an user, I find it a little inconvenient to have to
create one VLAN interface per VLAN that I am adding to the bridge to be
able to terminate that traffic properly towards the host/CPU/management
interface, especially when this VLAN is untagged.
This is really the motivation for these patches: if there is only one
VLAN configured, and it's the default VLAN for all ports, then the
bridge master interface also terminates this VLAN with the same
properties as those added by default (typically with default_pvid: VID 1
untagged, unless changed of course).
If you don't want that as an user, you now have the ability to change
it, and make this VLAN (or any other for that matter) to be terminated
differently at the host/CPU/management port level than how it is
egressing at the downstream ports part of that VLAN too.
Maybe it's a bit overkill...
>
> FWIW, I don't have a problem with patches if they are useful for you,
> I'm just trying to understand the use case. We can easily patch mlxsw to
> ignore calls with orig_dev=br0.
OK, if I resubmit, I will try to take care of mlxsw and rocker as well.
Thanks!
--
Florian
^ permalink raw reply
* [PATCH iproute2 1/1] tc: updated man page to reflect handle-id use in filter GET command.
From: Roman Mashak @ 2016-12-01 20:20 UTC (permalink / raw)
To: stephen; +Cc: netdev, sathya.perla, Roman Mashak
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
---
man/man8/tc.8 | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 8a47a2b..d957ffa 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -32,7 +32,9 @@ class-id ] qdisc
DEV
.B [ parent
qdisc-id
-.B | root ] protocol
+.B | root ] [ handle
+handle-id ]
+.B protocol
protocol
.B prio
priority filtertype
@@ -577,7 +579,7 @@ it is created.
.TP
get
-Displays a single filter given the interface, parent ID, priority, protocol and handle ID.
+Displays a single filter given the interface, qdisc-id, priority, protocol and handle-id.
.TP
show
--
1.9.1
^ permalink raw reply related
* Re: [PATCH] stmmac: simplify flag assignment
From: David Miller @ 2016-12-01 20:23 UTC (permalink / raw)
To: pavel; +Cc: peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <20161130114431.GB14296@amd>
From: Pavel Machek <pavel@ucw.cz>
Date: Wed, 30 Nov 2016 12:44:31 +0100
>
> Simplify flag assignment.
>
> Signed-off-by: Pavel Machek <pavel@denx.de>
>
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index ed20668..0b706a7 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -2771,12 +2771,8 @@ static netdev_features_t stmmac_fix_features(struct net_device *dev,
> features &= ~NETIF_F_CSUM_MASK;
>
> /* Disable tso if asked by ethtool */
> - if ((priv->plat->tso_en) && (priv->dma_cap.tsoen)) {
> - if (features & NETIF_F_TSO)
> - priv->tso = true;
> - else
> - priv->tso = false;
> - }
> + if ((priv->plat->tso_en) && (priv->dma_cap.tsoen))
> + priv->tso = !!(features & NETIF_F_TSO);
>
Pavel, this really seems arbitrary.
Whilst I really appreciate you're looking into this driver a bit because
of some issues you are trying to resolve, I'd like to ask that you not
start bombarding me with nit-pick cleanups here and there and instead
concentrate on the real bug or issue.
Thanks in advance.
^ permalink raw reply
* Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: David Miller @ 2016-12-01 20:26 UTC (permalink / raw)
To: hannes; +Cc: david.lebrun, netdev
In-Reply-To: <1480511568.3649771.803688521.5B47BE8F@webmail.messagingengine.com>
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 30 Nov 2016 14:12:48 +0100
> David, one question: do you remember if you measured with linked lists
> at that time or also with arrays. I actually would expect small arrays
> that entirely fit into cachelines to be actually faster than our current
> approach, which also walks a linked list, probably the best algorithm to
> trash cache lines. I ask because I currently prefer this approach more
> than having large allocations in the O(1) case because of easier code
> and easier management.
I did not try this and I do agree with you that for extremely small table
sizes a list or array would perform better because of the cache behavior.
^ permalink raw reply
* [PATCH -next] net: ethernet: ti: davinci_cpdma: add missing EXPORTs
From: Paul Gortmaker @ 2016-12-01 20:25 UTC (permalink / raw)
To: David S. Miller
Cc: Paul Gortmaker, Ivan Khoronzhuk, Mugunthan V N, Grygorii Strashko,
linux-omap, netdev
As of commit 8f32b90981dcdb355516fb95953133f8d4e6b11d
("net: ethernet: ti: davinci_cpdma: add set rate for a channel") the
ARM allmodconfig builds would fail modpost with:
ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
Since these weren't declared as static, it is assumed they were
meant to be shared outside the file, and that modular build testing
was simply overlooked.
Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a channel")
Cc: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Cc: Mugunthan V N <mugunthanvnm@ti.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>
Cc: linux-omap@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
---
drivers/net/ethernet/ti/davinci_cpdma.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/net/ethernet/ti/davinci_cpdma.c b/drivers/net/ethernet/ti/davinci_cpdma.c
index c776e4575d2d..36518fc5c7cc 100644
--- a/drivers/net/ethernet/ti/davinci_cpdma.c
+++ b/drivers/net/ethernet/ti/davinci_cpdma.c
@@ -796,6 +796,7 @@ int cpdma_chan_set_weight(struct cpdma_chan *ch, int weight)
spin_unlock_irqrestore(&ctlr->lock, flags);
return ret;
}
+EXPORT_SYMBOL_GPL(cpdma_chan_set_weight);
/* cpdma_chan_get_min_rate - get minimum allowed rate for channel
* Should be called before cpdma_chan_set_rate.
@@ -810,6 +811,7 @@ u32 cpdma_chan_get_min_rate(struct cpdma_ctlr *ctlr)
return DIV_ROUND_UP(divident, divisor);
}
+EXPORT_SYMBOL_GPL(cpdma_chan_get_min_rate);
/* cpdma_chan_set_rate - limits bandwidth for transmit channel.
* The bandwidth * limited channels have to be in order beginning from lowest.
@@ -853,6 +855,7 @@ int cpdma_chan_set_rate(struct cpdma_chan *ch, u32 rate)
spin_unlock_irqrestore(&ctlr->lock, flags);
return ret;
}
+EXPORT_SYMBOL_GPL(cpdma_chan_set_rate);
u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
{
@@ -865,6 +868,7 @@ u32 cpdma_chan_get_rate(struct cpdma_chan *ch)
return rate;
}
+EXPORT_SYMBOL_GPL(cpdma_chan_get_rate);
struct cpdma_chan *cpdma_chan_create(struct cpdma_ctlr *ctlr, int chan_num,
cpdma_handler_fn handler, int rx_type)
--
2.11.0
^ permalink raw reply related
* Re: [PATCH net] tcp: warn on bogus MSS and try to amend it
From: David Miller @ 2016-12-01 20:29 UTC (permalink / raw)
To: marcelo.leitner
Cc: netdev, jmaxwell37, alexandre.sidorenko, kuznet, jmorris,
yoshfuji, kaber, tlfalcon, brking, eric.dumazet
In-Reply-To: <0d41deb00d57206f518e6bffae1b0be355bbc726.1480511277.git.marcelo.leitner@gmail.com>
From: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Date: Wed, 30 Nov 2016 11:14:32 -0200
> There have been some reports lately about TCP connection stalls caused
> by NIC drivers that aren't setting gso_size on aggregated packets on rx
> path. This causes TCP to assume that the MSS is actually the size of the
> aggregated packet, which is invalid.
>
> Although the proper fix is to be done at each driver, it's often hard
> and cumbersome for one to debug, come to such root cause and report/fix
> it.
>
> This patch amends this situation in two ways. First, it adds a warning
> on when this situation occurs, so it gives a hint to those trying to
> debug this. It also limit the maximum probed MSS to the adverised MSS,
> as it should never be any higher than that.
>
> The result is that the connection may not have the best performance ever
> but it shouldn't stall, and the admin will have a hint on what to look
> for.
>
> Tested with virtio by forcing gso_size to 0.
>
> Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
I totally agree with this change, however I think the warning message can
be improved in two ways:
> len = skb_shinfo(skb)->gso_size ? : skb->len;
> if (len >= icsk->icsk_ack.rcv_mss) {
> - icsk->icsk_ack.rcv_mss = len;
> + icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> + tcp_sk(sk)->advmss);
> + if (icsk->icsk_ack.rcv_mss != len)
> + pr_warn_once("Seems your NIC driver is doing bad RX acceleration. TCP performance may be compromised.\n");
We know it's a bad GRO implementation that causes this so let's be specific in the
message, perhaps something like:
Driver has suspect GRO implementation, TCP performance may be compromised.
Also, we have skb->dev available here most likely, so prefixing the message with
skb->dev->name would make analyzing this situation even easier for someone hitting
this.
I'm not certain if an skb->dev==NULL check is necessary here or not, but it is
definitely something you need to consider.
Thanks!
^ permalink raw reply
* Re: [PATCH net-next 5/6] net: dsa: mv88e6xxx: add helper for switch ready
From: Vivien Didelot @ 2016-12-01 20:31 UTC (permalink / raw)
To: Andrew Lunn
Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130233810.GT21645@lunn.ch>
Hi Andrew,
Andrew Lunn <andrew@lunn.ch> writes:
> As we have seen in the past, this sort of loop is broken if we end up
> sleeping for a long time. Please take the opportunity to replace it
> with one of our _wait() helpers, e.g. mv88e6xxx_g1_wait()
That won't work. the _wait() helpers are made to wait on self-clear (SC)
bits, i.e. looping until they are cleared to zero.
Here we want the opposite.
I will keep this existing wait loop for the moment and work soon on a
new patchset to rework the wait routines. We need a generic access to
test a given value against a given mask and wrappers for busy bits, etc.
>> +int mv88e6xxx_g1_init_ready(struct mv88e6xxx_chip *chip, bool *ready)
>> +{
>> + u16 val;
>> + int err;
>> +
>> + /* Check the value of the InitReady bit 11 */
>> + err = mv88e6xxx_g1_read(chip, GLOBAL_STATUS, &val);
>> + if (err)
>> + return err;
>> +
>> + *ready = !!(val & GLOBAL_STATUS_INIT_READY);
>
> I would actually do the wait here.
That is better indeed.
Thanks,
Vivien
^ permalink raw reply
* Re: [PATCH v3 net-next 3/3] openvswitch: Fix skb->protocol for vlan frames.
From: Pravin Shelar @ 2016-12-01 20:31 UTC (permalink / raw)
To: Jiri Benc; +Cc: Jarno Rajahalme, Linux Kernel Network Developers, Eric Garver
In-Reply-To: <20161130153041.7a9590ef@griffin>
On Wed, Nov 30, 2016 at 6:30 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Tue, 29 Nov 2016 15:30:53 -0800, Jarno Rajahalme wrote:
>> Do not always set skb->protocol to be the ethertype of the L3 header.
>> For a packet with non-accelerated VLAN tags skb->protocol needs to be
>> the ethertype of the outermost non-accelerated VLAN ethertype.
>
> Well, the current handling of skb->protocol matches what used to be the
> handling of the kernel net stack before Jiri Pirko cleaned up the vlan
> code.
>
> I'm not opposed to changing this but I'm afraid it needs much deeper
> review. Because with this in place, no core kernel functions that
> depend on skb->protocol may be called from within openvswitch.
>
Can you give specific example where it does not work?
>> @@ -361,6 +362,11 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
>> if (res <= 0)
>> return res;
>>
>> + /* If the outer vlan tag was accelerated, skb->protocol should
>> + * refelect the inner vlan type. */
>> + if (!eth_type_vlan(skb->protocol))
>> + skb->protocol = key->eth.cvlan.tpid;
>
> This should not depend on the current value in skb->protocol which
> could be arbitrary at this point (from the point of view of how this
> patch understands the skb->protocol values). It's easy to fix, though -
> just add a local bool variable tracking whether the skb->protocol has
> been set.
>
skb-protocol value is set by the caller, so it should not be
arbitrary. is it missing in any case?
^ permalink raw reply
* pull-request: can-next 2016-12-01
From: Marc Kleine-Budde @ 2016-12-01 20:21 UTC (permalink / raw)
To: netdev; +Cc: David Miller, kernel@pengutronix.de, linux-can@vger.kernel.org
[-- Attachment #1.1: Type: text/plain, Size: 1907 bytes --]
Hello David,
this is a pull request of 4 patches for net-next/master.
There are two patches by Chris Paterson for the rcar_can and rcar_canfd
device tree binding documentation. And a patch by Geert Uytterhoeven
that corrects the order of interrupt specifiers.
The fourth patch by Colin Ian King fixes a spelling error in the
kvaser_usb driver.
regards,
Marc
---
The following changes since commit 8f679ed88f8860206edddff725e2749b4cdbb0e8:
driver: ipvlan: Remove useless member mtu_adj of struct ipvl_dev (2016-11-30 15:01:32 -0500)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git tags/linux-can-next-for-4.10-20161201
for you to fetch changes up to 0d8f8efd32bace9f222fcc92d4a3132d877e5df6:
net: can: usb: kvaser_usb: fix spelling mistake of "outstanding" (2016-12-01 14:27:02 +0100)
----------------------------------------------------------------
linux-can-next-for-4.10-20161201
----------------------------------------------------------------
Chris Paterson (2):
can: rcar_can: Add r8a7796 support
can: rcar_canfd: Add r8a7796 support
Colin Ian King (1):
net: can: usb: kvaser_usb: fix spelling mistake of "outstanding"
Geert Uytterhoeven (1):
can: rcar_canfd: Correct order of interrupt specifiers
Documentation/devicetree/bindings/net/can/rcar_can.txt | 12 +++++++-----
Documentation/devicetree/bindings/net/can/rcar_canfd.txt | 14 ++++++++------
drivers/net/can/usb/kvaser_usb.c | 4 ++--
3 files changed, 17 insertions(+), 13 deletions(-)
--
Pengutronix e.K. | Marc Kleine-Budde |
Industrial Linux Solutions | Phone: +49-231-2826-924 |
Vertretung West/Dortmund | Fax: +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686 | http://www.pengutronix.de |
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: Initial thoughts on TXDP
From: Tom Herbert @ 2016-12-01 20:39 UTC (permalink / raw)
To: Sowmini Varadhan; +Cc: Linux Kernel Network Developers
In-Reply-To: <20161201201324.GJ24547@oracle.com>
On Thu, Dec 1, 2016 at 12:13 PM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (12/01/16 11:05), Tom Herbert wrote:
>>
>> Polling does not necessarily imply that networking monopolizes the CPU
>> except when the CPU is otherwise idle. Presumably the application
>> drives the polling when it is ready to receive work.
>
> I'm not grokking that- "if the cpu is idle, we want to busy-poll
> and make it 0% idle"? Keeping CPU 0% idle has all sorts
> of issues, see slide 20 of
> http://www.slideshare.net/shemminger/dpdk-performance
>
>> > and one other critical difference from the hot-potato-forwarding
>> > model (the sort of OVS model that DPDK etc might aruguably be a fit for)
>> > does not apply: in order to figure out the ethernet and IP headers
>> > in the response correctly at all times (in the face of things like VRRP,
>> > gw changes, gw's mac addr changes etc) the application should really
>> > be listening on NETLINK sockets for modifications to the networking
>> > state - again points to needing a select() socket set where you can
>> > have both the I/O fds and the netlink socket,
>> >
>> I would think that that is management would not be implemented in a
>> fast path processing thread for an application.
>
> sure, but my point was that *XDP and other stack-bypass methods needs
> to provide a select()able socket: when your use-case is not about just
> networking, you have to snoop on changes to the control plane, and update
> your data path. In the OVS case (pure networking) the OVS control plane
> updates are intrinsic to OVS. For the rest of the request/response world,
> we need a select()able socket set to do this elegantly (not really
> possible in DPDK, for example)
>
I'm not sure that TXDP can be reconciled to help OVS. The point of
TXDP is to drive applications closer to bare metal performance, as I
mentioned this is only going to be worth it if the fast path can be
kept simple and not complicated by a requirement for generalization.
It seems like the second we put OVS in we're doubling the data path
and accepting the performance consequences of a complex path anyway.
TXDP can't over the whole system (any more than DPDK can) and needs to
work in concert with other mechanisms-- the key is how to steer the
work amongst the CPUs. For instance, if a latency critical thread is
running on some CPU we either a dedicated queue for the connections of
the thread (e.g. ntuple filtering or aRFS support) or we need a fast
way to get move unrelated packets received on a queue processed by
that CPU to other CPUs (less efficient, but no special HW support is
needed either).
Tom
>
>> The *SOs are always an interesting question. They make for great
>> benchmarks, but in real life the amount of benefit is somewhat
>> unclear. Under the wrong conditions, like all cwnds have collapsed or
>
> I think Rick's already bringing up this one.
>
> --Sowmini
>
^ permalink raw reply
* Re: iproute2 public git outdated?
From: Rami Rosen @ 2016-12-01 20:39 UTC (permalink / raw)
To: Phil Sutter, Netdev, Stephen Hemminger
In-Reply-To: <20161201121806.GA21576@orbyte.nwl.cc>
Hi Phil,
I suggest that you will try again now, it seems that the iproute2 git
repo was updated in the last 2-4 hours, and "git log" in master shows
now a patch from 30 of November (actually it is your "Add notes about
dropped IPv4 route cache" patch)
Regards,
Rami Rosen
On 1 December 2016 at 14:18, Phil Sutter <phil@nwl.cc> wrote:
> Hi,
>
> I am using iproute2's public git repo at this URL:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git
>
> To my surprise, neither master nor net-next branches have received new
> commits since end of October. Did the repo location change or was it
> just not updated for a while?
>
> Thanks, Phil
^ permalink raw reply
* Re: [PATCH net-next 0/3] sfc: defalconisation fixups
From: David Miller @ 2016-12-01 20:39 UTC (permalink / raw)
To: ecree; +Cc: linux-net-drivers, bkenward, netdev
In-Reply-To: <c52a0276-e379-7841-8d10-d5a834b81c4e@solarflare.com>
From: Edward Cree <ecree@solarflare.com>
Date: Thu, 1 Dec 2016 16:59:13 +0000
> A bug fix, the Kconfig change, and cleaning up a bit more unused code.
>
> Edward Cree (3):
> sfc: fix debug message format string in efx_farch_handle_rx_not_ok
> sfc: don't select SFC_FALCON
> sfc: remove RESET_TYPE_RX_RECOVERY
Series applied, thank you.
^ permalink raw reply
* Re: [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Hannes Frederic Sowa @ 2016-12-01 20:40 UTC (permalink / raw)
To: David Miller, idosch
Cc: jiri, netdev, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz,
roopa, dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
alexander.h.duyck, kaber
In-Reply-To: <20161201.150445.558407356269727869.davem@davemloft.net>
On 01.12.2016 21:04, David Miller wrote:
>
> Hannes and Ido,
>
> It looks like we are very close to having this in mergable shape, can
> you guys work out this final issue and figure out if it really is
> a merge stopped or not?
Sure, if the fib notification register could be done under protection of
the sequence counter I don't see any more problems.
The sync handler is nice to have and can be done in a later patch series.
^ permalink raw reply
* Re: [PATCH net-next 3/6] net: dsa: mv88e6xxx: add a software reset op
From: Vivien Didelot @ 2016-12-01 20:41 UTC (permalink / raw)
To: Andrew Lunn
Cc: netdev, linux-kernel, kernel, David S. Miller, Florian Fainelli
In-Reply-To: <20161130232633.GS21645@lunn.ch>
Hi Andrew,
Andrew Lunn <andrew@lunn.ch> writes:
>> diff --git a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
>> index ab52c37..9e51405 100644
>> --- a/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
>> +++ b/drivers/net/dsa/mv88e6xxx/mv88e6xxx.h
>> @@ -765,6 +765,9 @@ struct mv88e6xxx_ops {
>> int (*phy_write)(struct mv88e6xxx_chip *chip, int addr, int reg,
>> u16 val);
>>
>> + /* Switch Software Reset */
>> + int (*reset)(struct mv88e6xxx_chip *chip);
>> +
>
> Hi Vivien
>
> In my huge patch series of 6390, i've been using a g1_ prefix for
> functionality which is in global 1, g2_ for global 2, etc. This has
> worked for everything so far with the exception of setting which
> reserved MAC addresses should be sent to the CPU. Most devices have it
> in g2, but 6390 has it in g1.
>
> Please could you add the prefix.
I don't understand. It looks like you are talking about the second part
of the comment I made on your RFC patchset, about the Rsvd2CPU feature:
https://www.mail-archive.com/netdev@vger.kernel.org/msg139837.html
Switch reset routines are implemented in this patch in global1.c as
mv88e6185_g1_reset and mv88e6352_g1_reset.
6185 and 6352 are implementation references for other switches.
Thanks,
Vivien
^ permalink raw reply
* Re: [net PATCH 0/2] Don't use lco_csum to compute IPv4 checksum
From: David Miller @ 2016-12-01 20:41 UTC (permalink / raw)
To: jeffrey.t.kirsher; +Cc: alexander.h.duyck, netdev, intel-wired-lan, sfr
In-Reply-To: <1480540522.2377.18.camel@intel.com>
From: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Date: Wed, 30 Nov 2016 13:15:22 -0800
> On Wed, 2016-11-30 at 09:47 -0500, David Miller wrote:
>> From: Alexander Duyck <alexander.h.duyck@intel.com>
>> Date: Mon, 28 Nov 2016 10:42:18 -0500
>>
>> > When I implemented the GSO partial support in the Intel drivers I was
>> using
>> > lco_csum to compute the checksum that we needed to plug into the IPv4
>> > checksum field in order to cancel out the data that was not a part of
>> the
>> > IPv4 header. However this didn't take into account that the transport
>> > offset might be pointing to the inner transport header.
>> >
>> > Instead of using lco_csum I have just coded around it so that we can
>> use
>> > the outer IP header plus the IP header length to determine where we
>> need to
>> > start our checksum and then just call csum_partial ourselves.
>> >
>> > This should fix the SIT issue reported on igb interfaces as well as
>> simliar
>> > issues that would pop up on other Intel NICs.
>>
>> Jeff, are you going to send me a pull request with this stuff or would
>> you be OK with my applying these directly to 'net'?
>
> Go ahead and apply those to your net tree, I do not want to hold this up.
Ok, done, thanks Jeff.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox