Netdev List

Netdev List
 help / color / mirror / Atom feed

* ID of (former "National") TI's PHY DP83848
From: Juergen Borleis @ 2018-04-27 11:51 UTC (permalink / raw)
  To: netdev; +Cc: Andrew Lunn, Florian Fainelli

Hi,

I have worked on a DP83848 variant without an interrupt line. While at it, I 
read through all datasheets of existing variants of the DP83848 phy.

Here is what I've found so far:

+--------------+--------------+--------+-----+
| Variant      |    Phy ID    |  Note  | IRQ |
+--------------+--------------+--------+-----+
| DP83848H     |  0x20005c90  |  MINI  | NO  |
+--------------+--------------+--------+-----+
| DP83848J     |  0x20005c90  |  MINI  | NO  |
+--------------+--------------+--------+-----+
| DP83848K     |  0x20005c90  |  MINI  | NO  |
+--------------+--------------+--------+-----+
| DP83848M     |  0x20005c90  |  MINI  | NO  |
+--------------+--------------+--------+-----+
| DP83848T     |  0x20005c90  |  MINI  | NO  |
+--------------+--------------+--------+-----+
| DP83848EP    |  0x20005c90  |        | YES |
+--------------+--------------+--------+-----+
| DP83848HT    |  0x20005c90  |        | YES |
+--------------+--------------+--------+-----+
| DP83848Q     |  0x20005ca2  |  MINI  | NO  |
+--------------+--------------+--------+-----+
| DP83848V     |  0x20005ca2  |        | YES |
+--------------+--------------+--------+-----+
| DP83848C     |  0x20005ca2  |        | YES |
+--------------+--------------+--------+-----+

"MINI" means a less pin count variant.
"IRQ"-"YES" means, the device has a interrupt line,
"IRQ"-"NO" means, the device has no interrupt line.

How to deal with interrupts, if the device itself has no interrupt line (and 
uses the same phy ID)? How to deal with the same Phy ID used for different 
devices and different features?

Cheers,
Juergen

PS: please keep me on CC as I'm not subscribed to the netdev list.

-- 
Pengutronix e.K.                             | Juergen Borleis             |
Industrial Linux Solutions                   | http://www.pengutronix.de/  |

^ permalink raw reply

* Re: [tip:x86/cleanups] x86/bpf: Clean up non-standard comments, to make the code more readable
From: Daniel Borkmann @ 2018-04-27 12:13 UTC (permalink / raw)
  To: peterz, edumazet, davem, linux-kernel, torvalds, ast, bp, hpa,
	mingo, yoshfuji, tglx, linux-tip-commits, netdev
In-Reply-To: <tip-5f26c50143f58f256535bee8d93a105f36d4d2da@git.kernel.org>

Hi Ingo,

On 04/27/2018 01:00 PM, tip-bot for Ingo Molnar wrote:
> Commit-ID:  5f26c50143f58f256535bee8d93a105f36d4d2da
> Gitweb:     https://git.kernel.org/tip/5f26c50143f58f256535bee8d93a105f36d4d2da
> Author:     Ingo Molnar <mingo@kernel.org>
> AuthorDate: Fri, 27 Apr 2018 11:54:40 +0200
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Fri, 27 Apr 2018 12:42:04 +0200
> 
> x86/bpf: Clean up non-standard comments, to make the code more readable
> 
> So by chance I looked into x86 assembly in arch/x86/net/bpf_jit_comp.c and
> noticed the weird and inconsistent comment style it mistakenly learned from
> the networking code:
> 
>  /* Multi-line comment ...
>   * ... looks like this.
>   */
> 
> Fix this to use the standard comment style specified in Documentation/CodingStyle
> and used in arch/x86/ as well:
> 
>  /*
>   * Multi-line comment ...
>   * ... looks like this.
>   */
> 
> Also, to quote Linus's ... more explicit views about this:
> 
>   http://article.gmane.org/gmane.linux.kernel.cryptoapi/21066
> 
>   > But no, the networking code picked *none* of the above sane formats.
>   > Instead, it picked these two models that are just half-arsed
>   > shit-for-brains:
>   >
>   >  (no)
>   >      /* This is disgusting drug-induced
>   >        * crap, and should die
>   >        */
>   >
>   >   (no-no-no)
>   >       /* This is also very nasty
>   >        * and visually unbalanced */
>   >
>   > Please. The networking code actually has the *worst* possible comment
>   > style. You can literally find that (no-no-no) style, which is just
>   > really horribly disgusting and worse than the otherwise fairly similar
>   > (d) in pretty much every way.
> 
> Also improve the comments and some other details while at it:
> 
>  - Don't mix same-line and previous-line comment style on otherwise
>    identical code patterns within the same function,
> 
>  - capitalize 'BPF' and x86 register names consistently,
> 
>  - capitalize sentences consistently,
> 
>  - instead of 'x64' use 'x86-64': x64 is a Microsoft specific term,
> 
>  - use more consistent punctuation,
> 
>  - use standard coding style in macros as well,
> 
>  - fix typos and a few other minor details.
> 
> Consistent coding style is not optional, at least in arch/x86/.
> 
> No change in functionality.

Thanks for the cleanup, looks fine to me!

> ( In case this commit causes conflicts with pending development code
>   I'll be glad to help resolve any conflicts! )

Any objections if we would simply route this via bpf-next tree, otherwise
this will indeed cause really ugly merge conflicts throughout the JIT with
pending work.

> Acked-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Alexei Starovoitov <ast@fb.com>
> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Thanks,
Daniel

^ permalink raw reply

* Re: WARNING in tcp_enter_loss (2)
From: syzbot @ 2018-04-27 12:16 UTC (permalink / raw)
  To: davem, kuznet, linux-kernel, netdev, syzkaller-bugs, yoshfuji
In-Reply-To: <001a113f39820d16d50567379661@google.com>

syzbot has found reproducer for the following crash on upstream commit
0644f186fc9d77bb5bd198369e59fb28927a3692 (Thu Apr 26 23:36:11 2018 +0000)
Merge tag 'for_linus' of  
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=c5a3099b94cbdd9cd6da

So far this crash happened 2 times on net-next, upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5374384306913280
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=4821663019433984
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5119802469253120
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=7043958930931867332
compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c5a3099b94cbdd9cd6da@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed.

WARNING: CPU: 0 PID: 4456 at net/ipv4/tcp_input.c:1955  
tcp_enter_loss+0xe4f/0x1110 net/ipv4/tcp_input.c:1955
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 4456 Comm: syz-executor694 Not tainted 4.17.0-rc2+ #19
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  panic+0x22f/0x4de kernel/panic.c:184
  __warn.cold.8+0x163/0x1b3 kernel/panic.c:536
  report_bug+0x252/0x2d0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:178 [inline]
  do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:tcp_enter_loss+0xe4f/0x1110 net/ipv4/tcp_input.c:1955
RSP: 0018:ffff8801b66c7560 EFLAGS: 00010293
RAX: ffff8801b66686c0 RBX: 0000000000000001 RCX: ffffffff864ac155
RDX: 0000000000000000 RSI: ffffffff864ac5bf RDI: 0000000000000004
RBP: ffff8801b66c75e0 R08: ffff8801b66686c0 R09: 0000000000000000
R10: ffffed0043fff001 R11: ffff88021fff8017 R12: 0000000000000003
R13: 0000000000000002 R14: ffff8801c8c6dd30 R15: ffff8801d02e5500
WARNING: CPU: 1 PID: 4450 at net/ipv4/tcp_input.c:1955  
tcp_enter_loss+0xe4f/0x1110 net/ipv4/tcp_input.c:1955
  tcp_retransmit_timer+0xc34/0x3060 net/ipv4/tcp_timer.c:486
Modules linked in:
CPU: 1 PID: 4450 Comm: syz-executor694 Not tainted 4.17.0-rc2+ #19
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:tcp_enter_loss+0xe4f/0x1110 net/ipv4/tcp_input.c:1955
RSP: 0018:ffff8801b60b7560 EFLAGS: 00010293
RAX: ffff8801b662e500 RBX: 0000000000000001 RCX: ffffffff864ac155
RDX: 0000000000000000 RSI: ffffffff864ac5bf RDI: 0000000000000004
RBP: ffff8801b60b75e0 R08: ffff8801b662e500 R09: 0000000000000000
R10: ffffed0043fff009 R11: ffff88021fff8057 R12: 0000000000000003
R13: 0000000000000002 R14: ffff8801cc3cf870 R15: ffff8801cd4f0a80
FS:  00000000015e1880(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000021000000 CR3: 00000001b631c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  tcp_write_timer_handler+0x339/0x960 net/ipv4/tcp_timer.c:573
  tcp_retransmit_timer+0xc34/0x3060 net/ipv4/tcp_timer.c:486
  tcp_release_cb+0x25e/0x2d0 net/ipv4/tcp_output.c:871
  release_sock+0x107/0x2b0 net/core/sock.c:2856
  do_tcp_setsockopt.isra.38+0x48e/0x2600 net/ipv4/tcp.c:2880
  tcp_write_timer_handler+0x339/0x960 net/ipv4/tcp_timer.c:573
  tcp_setsockopt+0xc1/0xe0 net/ipv4/tcp.c:2892
  sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3039
  tcp_release_cb+0x25e/0x2d0 net/ipv4/tcp_output.c:871
  __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
  release_sock+0x107/0x2b0 net/core/sock.c:2856
  __do_sys_setsockopt net/socket.c:1914 [inline]
  __se_sys_setsockopt net/socket.c:1911 [inline]
  __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  do_tcp_setsockopt.isra.38+0x48e/0x2600 net/ipv4/tcp.c:2880
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441bc9
RSP: 002b:00007ffe202bc838 EFLAGS: 00000207
  ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000441bc9
RDX: 0000000000000016 RSI: 0000000000000006 RDI: 0000000000000003
RBP: 00000000006cd018 R08: 000000002000023b R09: 0000000000000010
  tcp_setsockopt+0xc1/0xe0 net/ipv4/tcp.c:2892
R10: 0000000020000040 R11: 0000000000000207 R12: 0000000000402810
  sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3039
R13: 00000000004028a0 R14: 0000000000000000 R15: 0000000000000000
  __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
  __do_sys_setsockopt net/socket.c:1914 [inline]
  __se_sys_setsockopt net/socket.c:1911 [inline]
  __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441bc9
RSP: 002b:00007ffe202bc838 EFLAGS: 00000207 ORIG_RAX: 0000000000000036
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000441bc9
RDX: 0000000000000016 RSI: 0000000000000006 RDI: 0000000000000003
RBP: 00000000006cd018 R08: 000000002000023b R09: 0000000000000010
R10: 0000000020000040 R11: 0000000000000207 R12: 0000000000402810
R13: 00000000004028a0 R14: 0000000000000000 R15: 0000000000000000
Code: 89 a7 38 08 00 00 e9 07 fc ff ff 49 8d 87 78 09 00 00 48 89 45 88 49  
8d 87 68 07 00 00 48 89 45 d0 e9 c5 f2 ff ff e8 91 6a 2e fb <0f> 0b e9 98  
fb ff ff e8 55 cb 6a fb e9 de f6 ff ff 48 8b 7d d0
irq event stamp: 76541
hardirqs last  enabled at (76539): [<ffffffff878009d5>]  
restore_regs_and_return_to_kernel+0x0/0x2b
hardirqs last disabled at (76541): [<ffffffff87801166>]  
error_entry+0x76/0xd0 arch/x86/entry/entry_64.S:1262
softirqs last  enabled at (76528): [<ffffffff87a00778>]  
__do_softirq+0x778/0xaf5 kernel/softirq.c:311
softirqs last disabled at (76540): [<ffffffff85d60074>] spin_lock_bh  
include/linux/spinlock.h:315 [inline]
softirqs last disabled at (76540): [<ffffffff85d60074>]  
release_sock+0x74/0x2b0 net/core/sock.c:2848
---[ end trace a7562162d42a707b ]---
Dumping ftrace buffer:
    (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..

^ permalink raw reply

* [PATCH net-next v4] Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-04-27 12:17 UTC (permalink / raw)
  To: netdev; +Cc: cake, Toke Høiland-Jørgensen, Dave Taht

sch_cake targets the home router use case and is intended to squeeze the
most bandwidth and latency out of even the slowest ISP links and routers,
while presenting an API simple enough that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash

CAKE is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Extensive support for DSL framing types.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency
  variation.

A paper describing the design of CAKE is available at
https://arxiv.org/abs/1804.07617

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

CAKE's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

tc -s qdisc show dev eth2
qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms raw overhead 0
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 backlog 0b 0p requeues 0
 memory used: 0b of 5000000b
 capacity estimate: 100Mbit
 min/max network layer size:        65535 /       0
 min/max overhead-adjusted size:    65535 /       0
 average network hdr offset:            0

                   Bulk  Best Effort        Voice
  thresh       6250Kbit      100Mbit       25Mbit
  target          5.0ms        5.0ms        5.0ms
  interval      100.0ms      100.0ms      100.0ms
  pk_delay          0us          0us          0us
  av_delay          0us          0us          0us
  sp_delay          0us          0us          0us
  pkts                0            0            0
  bytes               0            0            0
  way_inds            0            0            0
  way_miss            0            0            0
  way_cols            0            0            0
  drops               0            0            0
  marks               0            0            0
  ack_drop            0            0            0
  sp_flows            0            0            0
  bk_flows            0            0            0
  un_flows            0            0            0
  max_len             0            0            0
  quantum           300         1514          762

Tested-by: Pete Heist <peteheist@gmail.com>
Tested-by: Georgios Amanakis <gamanakis@gmail.com>
Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
Changelog:
v4:
  - Only split GSO packets if shaping at speeds <= 1Gbps
  - Fix overhead calculation code to also work for GSO packets
  - Don't re-implement kvzalloc()
  - Remove local header include from out-of-tree build (fixes kbuild-bot
    complaint).
  - Several fixes to the ACK filter:
    - Check pskb_may_pull() before deref of transport headers.
    - Don't run ACK filter logic on split GSO packets
    - Fix TCP sequence number compare to deal with wraparounds

v3:
  - Use IS_REACHABLE() macro to fix compilation when sch_cake is
    built-in and conntrack is a module.
  - Switch the stats output to use nested netlink attributes instead
    of a versioned struct.
  - Remove GPL boilerplate.
  - Fix array initialisation style.

v2:
  - Fix kbuild test bot complaint
  - Clean up the netlink ABI
  - Fix checkpatch complaints
  - A few tweaks to the behaviour of cake based on testing carried out
    while writing the paper.

 include/uapi/linux/pkt_sched.h |  105 ++
 net/sched/Kconfig              |   11 +
 net/sched/Makefile             |    1 +
 net/sched/sch_cake.c           | 2641 ++++++++++++++++++++++++++++++++
 4 files changed, 2758 insertions(+)
 create mode 100644 net/sched/sch_cake.c

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 37b5096ae97b..bc581473c0b0 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -934,4 +934,109 @@ enum {
 
 #define TCA_CBS_MAX (__TCA_CBS_MAX - 1)
 
+/* CAKE */
+enum {
+	TCA_CAKE_UNSPEC,
+	TCA_CAKE_BASE_RATE,
+	TCA_CAKE_DIFFSERV_MODE,
+	TCA_CAKE_ATM,
+	TCA_CAKE_FLOW_MODE,
+	TCA_CAKE_OVERHEAD,
+	TCA_CAKE_RTT,
+	TCA_CAKE_TARGET,
+	TCA_CAKE_AUTORATE,
+	TCA_CAKE_MEMORY,
+	TCA_CAKE_NAT,
+	TCA_CAKE_RAW,
+	TCA_CAKE_WASH,
+	TCA_CAKE_MPU,
+	TCA_CAKE_INGRESS,
+	TCA_CAKE_ACK_FILTER,
+	TCA_CAKE_SPLIT_GSO,
+	__TCA_CAKE_MAX
+};
+#define TCA_CAKE_MAX	(__TCA_CAKE_MAX - 1)
+
+enum {
+	__TCA_CAKE_STATS_INVALID,
+	TCA_CAKE_STATS_CAPACITY_ESTIMATE,
+	TCA_CAKE_STATS_MEMORY_LIMIT,
+	TCA_CAKE_STATS_MEMORY_USED,
+	TCA_CAKE_STATS_AVG_NETOFF,
+	TCA_CAKE_STATS_MIN_NETLEN,
+	TCA_CAKE_STATS_MAX_NETLEN,
+	TCA_CAKE_STATS_MIN_ADJLEN,
+	TCA_CAKE_STATS_MAX_ADJLEN,
+	TCA_CAKE_STATS_TIN_STATS,
+	__TCA_CAKE_STATS_MAX
+};
+#define TCA_CAKE_STATS_MAX (__TCA_CAKE_STATS_MAX - 1)
+
+enum {
+	__TCA_CAKE_TIN_STATS_INVALID,
+	TCA_CAKE_TIN_STATS_PAD,
+	TCA_CAKE_TIN_STATS_SENT_PACKETS,
+	TCA_CAKE_TIN_STATS_SENT_BYTES64,
+	TCA_CAKE_TIN_STATS_DROPPED_PACKETS,
+	TCA_CAKE_TIN_STATS_DROPPED_BYTES64,
+	TCA_CAKE_TIN_STATS_ACKS_DROPPED_PACKETS,
+	TCA_CAKE_TIN_STATS_ACKS_DROPPED_BYTES64,
+	TCA_CAKE_TIN_STATS_ECN_MARKED_PACKETS,
+	TCA_CAKE_TIN_STATS_ECN_MARKED_BYTES64,
+	TCA_CAKE_TIN_STATS_BACKLOG_PACKETS,
+	TCA_CAKE_TIN_STATS_BACKLOG_BYTES64,
+	TCA_CAKE_TIN_STATS_THRESHOLD_RATE,
+	TCA_CAKE_TIN_STATS_TARGET_US,
+	TCA_CAKE_TIN_STATS_INTERVAL_US,
+	TCA_CAKE_TIN_STATS_WAY_INDIRECT_HITS,
+	TCA_CAKE_TIN_STATS_WAY_MISSES,
+	TCA_CAKE_TIN_STATS_WAY_COLLISIONS,
+	TCA_CAKE_TIN_STATS_PEAK_DELAY_US,
+	TCA_CAKE_TIN_STATS_AVG_DELAY_US,
+	TCA_CAKE_TIN_STATS_BASE_DELAY_US,
+	TCA_CAKE_TIN_STATS_SPARSE_FLOWS,
+	TCA_CAKE_TIN_STATS_BULK_FLOWS,
+	TCA_CAKE_TIN_STATS_UNRESPONSIVE_FLOWS,
+	TCA_CAKE_TIN_STATS_MAX_SKBLEN,
+	TCA_CAKE_TIN_STATS_FLOW_QUANTUM,
+	__TCA_CAKE_TIN_STATS_MAX
+};
+#define TCA_CAKE_TIN_STATS_MAX (__TCA_CAKE_TIN_STATS_MAX - 1)
+#define TC_CAKE_MAX_TINS (8)
+
+enum {
+	CAKE_FLOW_NONE = 0,
+	CAKE_FLOW_SRC_IP,
+	CAKE_FLOW_DST_IP,
+	CAKE_FLOW_HOSTS,    /* = CAKE_FLOW_SRC_IP | CAKE_FLOW_DST_IP */
+	CAKE_FLOW_FLOWS,
+	CAKE_FLOW_DUAL_SRC, /* = CAKE_FLOW_SRC_IP | CAKE_FLOW_FLOWS */
+	CAKE_FLOW_DUAL_DST, /* = CAKE_FLOW_DST_IP | CAKE_FLOW_FLOWS */
+	CAKE_FLOW_TRIPLE,   /* = CAKE_FLOW_HOSTS  | CAKE_FLOW_FLOWS */
+	CAKE_FLOW_MAX,
+};
+
+enum {
+	CAKE_DIFFSERV_DIFFSERV3 = 0,
+	CAKE_DIFFSERV_DIFFSERV4,
+	CAKE_DIFFSERV_DIFFSERV8,
+	CAKE_DIFFSERV_BESTEFFORT,
+	CAKE_DIFFSERV_PRECEDENCE,
+	CAKE_DIFFSERV_MAX
+};
+
+enum {
+	CAKE_ACK_NONE = 0,
+	CAKE_ACK_FILTER,
+	CAKE_ACK_AGGRESSIVE,
+	CAKE_ACK_MAX
+};
+
+enum {
+	CAKE_ATM_NONE = 0,
+	CAKE_ATM_ATM,
+	CAKE_ATM_PTM,
+	CAKE_ATM_MAX
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a01169fb5325..6e7d614b5757 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -284,6 +284,17 @@ config NET_SCH_FQ_CODEL
 
 	  If unsure, say N.
 
+config NET_SCH_CAKE
+	tristate "Common Applications Kept Enhanced (CAKE)"
+	help
+	  Say Y here if you want to use the Common Applications Kept Enhanced
+          (CAKE) queue management algorithm.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called sch_cake.
+
+	  If unsure, say N.
+
 config NET_SCH_FQ
 	tristate "Fair Queue"
 	help
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8811d3804878..435054cee32c 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_NET_SCH_CHOKE)	+= sch_choke.o
 obj-$(CONFIG_NET_SCH_QFQ)	+= sch_qfq.o
 obj-$(CONFIG_NET_SCH_CODEL)	+= sch_codel.o
 obj-$(CONFIG_NET_SCH_FQ_CODEL)	+= sch_fq_codel.o
+obj-$(CONFIG_NET_SCH_CAKE)	+= sch_cake.o
 obj-$(CONFIG_NET_SCH_FQ)	+= sch_fq.o
 obj-$(CONFIG_NET_SCH_HHF)	+= sch_hhf.o
 obj-$(CONFIG_NET_SCH_PIE)	+= sch_pie.o
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
new file mode 100644
index 000000000000..52af01a3f5bc
--- /dev/null
+++ b/net/sched/sch_cake.c
@@ -0,0 +1,2641 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+
+/* COMMON Applications Kept Enhanced (CAKE) discipline
+ *
+ * Copyright (C) 2014-2018 Jonathan Morton <chromatix99@gmail.com>
+ * Copyright (C) 2015-2018 Toke Høiland-Jørgensen <toke@toke.dk>
+ * Copyright (C) 2014-2018 Dave Täht <dave.taht@gmail.com>
+ * Copyright (C) 2015-2018 Sebastian Moeller <moeller0@gmx.de>
+ * (C) 2015-2018 Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk>
+ * Copyright (C) 2017 Ryan Mounce <ryan@mounce.com.au>
+ *
+ * The CAKE Principles:
+ *		   (or, how to have your cake and eat it too)
+ *
+ * This is a combination of several shaping, AQM and FQ techniques into one
+ * easy-to-use package:
+ *
+ * - An overall bandwidth shaper, to move the bottleneck away from dumb CPE
+ *   equipment and bloated MACs.  This operates in deficit mode (as in sch_fq),
+ *   eliminating the need for any sort of burst parameter (eg. token bucket
+ *   depth).  Burst support is limited to that necessary to overcome scheduling
+ *   latency.
+ *
+ * - A Diffserv-aware priority queue, giving more priority to certain classes,
+ *   up to a specified fraction of bandwidth.  Above that bandwidth threshold,
+ *   the priority is reduced to avoid starving other tins.
+ *
+ * - Each priority tin has a separate Flow Queue system, to isolate traffic
+ *   flows from each other.  This prevents a burst on one flow from increasing
+ *   the delay to another.  Flows are distributed to queues using a
+ *   set-associative hash function.
+ *
+ * - Each queue is actively managed by Cobalt, which is a combination of the
+ *   Codel and Blue AQM algorithms.  This serves flows fairly, and signals
+ *   congestion early via ECN (if available) and/or packet drops, to keep
+ *   latency low.  The codel parameters are auto-tuned based on the bandwidth
+ *   setting, as is necessary at low bandwidths.
+ *
+ * The configuration parameters are kept deliberately simple for ease of use.
+ * Everything has sane defaults.  Complete generality of configuration is *not*
+ * a goal.
+ *
+ * The priority queue operates according to a weighted DRR scheme, combined with
+ * a bandwidth tracker which reuses the shaper logic to detect which side of the
+ * bandwidth sharing threshold the tin is operating.  This determines whether a
+ * priority-based weight (high) or a bandwidth-based weight (low) is used for
+ * that tin in the current pass.
+ *
+ * This qdisc was inspired by Eric Dumazet's fq_codel code, which he kindly
+ * granted us permission to leverage.
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/in.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/skbuff.h>
+#include <linux/jhash.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/reciprocal_div.h>
+#include <net/netlink.h>
+#include <linux/version.h>
+#include <linux/if_vlan.h>
+#include <net/pkt_sched.h>
+#include <net/tcp.h>
+#include <net/flow_dissector.h>
+
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
+#include <net/netfilter/nf_conntrack.h>
+#endif
+
+#define CAKE_SET_WAYS (8)
+#define CAKE_MAX_TINS (8)
+#define CAKE_QUEUES (1024)
+#define CAKE_SPLIT_GSO_THRESHOLD (125000000) /* 1Gbps */
+#define US2TIME(a) (a * (u64)NSEC_PER_USEC)
+
+typedef u64 cobalt_time_t;
+typedef s64 cobalt_tdiff_t;
+
+/**
+ * struct cobalt_params - contains codel and blue parameters
+ * @interval:	codel initial drop rate
+ * @target:     maximum persistent sojourn time & blue update rate
+ * @mtu_time:   serialisation delay of maximum-size packet
+ * @p_inc:      increment of blue drop probability (0.32 fxp)
+ * @p_dec:      decrement of blue drop probability (0.32 fxp)
+ */
+struct cobalt_params {
+	cobalt_time_t	interval;
+	cobalt_time_t	target;
+	cobalt_time_t	mtu_time;
+	u32		p_inc;
+	u32		p_dec;
+};
+
+/* struct cobalt_vars - contains codel and blue variables
+ * @count:	  codel dropping frequency
+ * @rec_inv_sqrt: reciprocal value of sqrt(count) >> 1
+ * @drop_next:    time to drop next packet, or when we dropped last
+ * @blue_timer:	  Blue time to next drop
+ * @p_drop:       BLUE drop probability (0.32 fxp)
+ * @dropping:     set if in dropping state
+ * @ecn_marked:   set if marked
+ */
+struct cobalt_vars {
+	u32		count;
+	u32		rec_inv_sqrt;
+	cobalt_time_t	drop_next;
+	cobalt_time_t	blue_timer;
+	u32     p_drop;
+	bool	dropping;
+	bool    ecn_marked;
+};
+
+enum {
+	CAKE_SET_NONE = 0,
+	CAKE_SET_SPARSE,
+	CAKE_SET_SPARSE_WAIT, /* counted in SPARSE, actually in BULK */
+	CAKE_SET_BULK,
+	CAKE_SET_DECAYING
+};
+
+struct cake_flow {
+	/* this stuff is all needed per-flow at dequeue time */
+	struct sk_buff	  *head;
+	struct sk_buff	  *tail;
+	struct sk_buff	  *ackcheck;
+	struct list_head  flowchain;
+	s32		  deficit;
+	struct cobalt_vars cvars;
+	u16		  srchost; /* index into cake_host table */
+	u16		  dsthost;
+	u8		  set;
+}; /* please try to keep this structure <= 64 bytes */
+
+struct cake_host {
+	u32 srchost_tag;
+	u32 dsthost_tag;
+	u16 srchost_refcnt;
+	u16 dsthost_refcnt;
+};
+
+struct cake_heap_entry {
+	u16 t:3, b:10;
+};
+
+struct cake_tin_data {
+	struct cake_flow flows[CAKE_QUEUES];
+	u32	backlogs[CAKE_QUEUES];
+	u32	tags[CAKE_QUEUES]; /* for set association */
+	u16	overflow_idx[CAKE_QUEUES];
+	struct cake_host hosts[CAKE_QUEUES]; /* for triple isolation */
+	u32	perturb;
+	u16	flow_quantum;
+
+	struct cobalt_params cparams;
+	u32	drop_overlimit;
+	u16	bulk_flow_count;
+	u16	sparse_flow_count;
+	u16	decaying_flow_count;
+	u16	unresponsive_flow_count;
+
+	u32	max_skblen;
+
+	struct list_head new_flows;
+	struct list_head old_flows;
+	struct list_head decaying_flows;
+
+	/* time_next = time_this + ((len * rate_ns) >> rate_shft) */
+	u64	tin_time_next_packet;
+	u32	tin_rate_ns;
+	u32	tin_rate_bps;
+	u16	tin_rate_shft;
+
+	u16	tin_quantum_prio;
+	u16	tin_quantum_band;
+	s32	tin_deficit;
+	u32	tin_backlog;
+	u32	tin_dropped;
+	u32	tin_ecn_mark;
+
+	u32	packets;
+	u64	bytes;
+
+	u32	ack_drops;
+
+	/* moving averages */
+	cobalt_time_t avge_delay;
+	cobalt_time_t peak_delay;
+	cobalt_time_t base_delay;
+
+	/* hash function stats */
+	u32	way_directs;
+	u32	way_hits;
+	u32	way_misses;
+	u32	way_collisions;
+}; /* number of tins is small, so size of this struct doesn't matter much */
+
+struct cake_sched_data {
+	struct cake_tin_data *tins;
+
+	struct cake_heap_entry overflow_heap[CAKE_QUEUES * CAKE_MAX_TINS];
+	u16		overflow_timeout;
+
+	u16		tin_cnt;
+	u8		tin_mode;
+#define	CAKE_FLOW_NAT_FLAG 64
+	u8		flow_mode;
+	u8		ack_filter;
+	u8		atm_mode;
+
+	/* time_next = time_this + ((len * rate_ns) >> rate_shft) */
+	u16		rate_shft;
+	u64		time_next_packet;
+	u64		failsafe_next_packet;
+	u32		rate_ns;
+	u32		rate_bps;
+	u16		rate_flags;
+	s16		rate_overhead;
+	u16		rate_mpu;
+	u32		interval;
+	u32		target;
+
+	/* resource tracking */
+	u32		buffer_used;
+	u32		buffer_max_used;
+	u32		buffer_limit;
+	u32		buffer_config_limit;
+
+	/* indices for dequeue */
+	u16		cur_tin;
+	u16		cur_flow;
+
+	struct qdisc_watchdog watchdog;
+	const u8	*tin_index;
+	const u8	*tin_order;
+
+	/* bandwidth capacity estimate */
+	u64		last_packet_time;
+	u64		avg_packet_interval;
+	u64		avg_window_begin;
+	u32		avg_window_bytes;
+	u32		avg_peak_bandwidth;
+	u64		last_reconfig_time;
+
+	/* packet length stats */
+	u32 avg_netoff;
+	u16 max_netlen;
+	u16 max_adjlen;
+	u16 min_netlen;
+	u16 min_adjlen;
+};
+
+enum {
+	CAKE_FLAG_OVERHEAD	   = BIT(0),
+	CAKE_FLAG_AUTORATE_INGRESS = BIT(1),
+	CAKE_FLAG_INGRESS	   = BIT(2),
+	CAKE_FLAG_WASH		   = BIT(3),
+	CAKE_FLAG_SPLIT_GSO	   = BIT(4)
+};
+
+/* COBALT operates the Codel and BLUE algorithms in parallel, in order to
+ * obtain the best features of each.  Codel is excellent on flows which
+ * respond to congestion signals in a TCP-like way.  BLUE is more effective on
+ * unresponsive flows.
+ */
+
+struct cobalt_skb_cb {
+	cobalt_time_t enqueue_time;
+	u32           adjusted_len;
+};
+
+static inline cobalt_time_t cobalt_get_time(void)
+{
+	return ktime_get_ns();
+}
+
+static inline u32 cobalt_time_to_us(cobalt_time_t val)
+{
+	do_div(val, NSEC_PER_USEC);
+	return (u32)val;
+}
+
+static inline struct cobalt_skb_cb *get_cobalt_cb(const struct sk_buff *skb)
+{
+	qdisc_cb_private_validate(skb, sizeof(struct cobalt_skb_cb));
+	return (struct cobalt_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static inline cobalt_time_t cobalt_get_enqueue_time(const struct sk_buff *skb)
+{
+	return get_cobalt_cb(skb)->enqueue_time;
+}
+
+static inline void cobalt_set_enqueue_time(struct sk_buff *skb,
+					   cobalt_time_t now)
+{
+	get_cobalt_cb(skb)->enqueue_time = now;
+}
+
+static u16 quantum_div[CAKE_QUEUES + 1] = {0};
+
+/* Diffserv lookup tables */
+
+static const u8 precedence[] = {
+	0, 0, 0, 0, 0, 0, 0, 0,
+	1, 1, 1, 1, 1, 1, 1, 1,
+	2, 2, 2, 2, 2, 2, 2, 2,
+	3, 3, 3, 3, 3, 3, 3, 3,
+	4, 4, 4, 4, 4, 4, 4, 4,
+	5, 5, 5, 5, 5, 5, 5, 5,
+	6, 6, 6, 6, 6, 6, 6, 6,
+	7, 7, 7, 7, 7, 7, 7, 7,
+};
+
+static const u8 diffserv8[] = {
+	2, 5, 1, 2, 4, 2, 2, 2,
+	0, 2, 1, 2, 1, 2, 1, 2,
+	5, 2, 4, 2, 4, 2, 4, 2,
+	3, 2, 3, 2, 3, 2, 3, 2,
+	6, 2, 3, 2, 3, 2, 3, 2,
+	6, 2, 2, 2, 6, 2, 6, 2,
+	7, 2, 2, 2, 2, 2, 2, 2,
+	7, 2, 2, 2, 2, 2, 2, 2,
+};
+
+static const u8 diffserv4[] = {
+	0, 2, 0, 0, 2, 0, 0, 0,
+	1, 0, 0, 0, 0, 0, 0, 0,
+	2, 0, 2, 0, 2, 0, 2, 0,
+	2, 0, 2, 0, 2, 0, 2, 0,
+	3, 0, 2, 0, 2, 0, 2, 0,
+	3, 0, 0, 0, 3, 0, 3, 0,
+	3, 0, 0, 0, 0, 0, 0, 0,
+	3, 0, 0, 0, 0, 0, 0, 0,
+};
+
+static const u8 diffserv3[] = {
+	0, 0, 0, 0, 2, 0, 0, 0,
+	1, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 2, 0, 2, 0,
+	2, 0, 0, 0, 0, 0, 0, 0,
+	2, 0, 0, 0, 0, 0, 0, 0,
+};
+
+static const u8 besteffort[] = {
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+};
+
+/* tin priority order for stats dumping */
+
+static const u8 normal_order[] = {0, 1, 2, 3, 4, 5, 6, 7};
+static const u8 bulk_order[] = {1, 0, 2, 3};
+
+#define REC_INV_SQRT_CACHE (16)
+static u32 cobalt_rec_inv_sqrt_cache[REC_INV_SQRT_CACHE] = {0};
+
+/* http://en.wikipedia.org/wiki/Methods_of_computing_square_roots
+ * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
+ *
+ * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
+ */
+
+static void cobalt_newton_step(struct cobalt_vars *vars)
+{
+	u32 invsqrt = vars->rec_inv_sqrt;
+	u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 32;
+	u64 val = (3LL << 32) - ((u64)vars->count * invsqrt2);
+
+	val >>= 2; /* avoid overflow in following multiply */
+	val = (val * invsqrt) >> (32 - 2 + 1);
+
+	vars->rec_inv_sqrt = val;
+}
+
+static void cobalt_invsqrt(struct cobalt_vars *vars)
+{
+	if (vars->count < REC_INV_SQRT_CACHE)
+		vars->rec_inv_sqrt = cobalt_rec_inv_sqrt_cache[vars->count];
+	else
+		cobalt_newton_step(vars);
+}
+
+/* There is a big difference in timing between the accurate values placed in
+ * the cache and the approximations given by a single Newton step for small
+ * count values, particularly when stepping from count 1 to 2 or vice versa.
+ * Above 16, a single Newton step gives sufficient accuracy in either
+ * direction, given the precision stored.
+ *
+ * The magnitude of the error when stepping up to count 2 is such as to give
+ * the value that *should* have been produced at count 4.
+ */
+
+static void cobalt_cache_init(void)
+{
+	struct cobalt_vars v;
+
+	memset(&v, 0, sizeof(v));
+	v.rec_inv_sqrt = ~0U;
+	cobalt_rec_inv_sqrt_cache[0] = v.rec_inv_sqrt;
+
+	for (v.count = 1; v.count < REC_INV_SQRT_CACHE; v.count++) {
+		cobalt_newton_step(&v);
+		cobalt_newton_step(&v);
+		cobalt_newton_step(&v);
+		cobalt_newton_step(&v);
+
+		cobalt_rec_inv_sqrt_cache[v.count] = v.rec_inv_sqrt;
+	}
+}
+
+static void cobalt_vars_init(struct cobalt_vars *vars)
+{
+	memset(vars, 0, sizeof(*vars));
+
+	if (!cobalt_rec_inv_sqrt_cache[0]) {
+		cobalt_cache_init();
+		cobalt_rec_inv_sqrt_cache[0] = ~0;
+	}
+}
+
+/* CoDel control_law is t + interval/sqrt(count)
+ * We maintain in rec_inv_sqrt the reciprocal value of sqrt(count) to avoid
+ * both sqrt() and divide operation.
+ */
+static cobalt_time_t cobalt_control(cobalt_time_t t,
+				    cobalt_time_t interval,
+				    u32 rec_inv_sqrt)
+{
+	return t + reciprocal_scale(interval, rec_inv_sqrt);
+}
+
+/* Call this when a packet had to be dropped due to queue overflow.  Returns
+ * true if the BLUE state was quiescent before but active after this call.
+ */
+static bool cobalt_queue_full(struct cobalt_vars *vars,
+			      struct cobalt_params *p,
+			      cobalt_time_t now)
+{
+	bool up = false;
+
+	if ((now - vars->blue_timer) > p->target) {
+		up = !vars->p_drop;
+		vars->p_drop += p->p_inc;
+		if (vars->p_drop < p->p_inc)
+			vars->p_drop = ~0;
+		vars->blue_timer = now;
+	}
+	vars->dropping = true;
+	vars->drop_next = now;
+	if (!vars->count)
+		vars->count = 1;
+
+	return up;
+}
+
+/* Call this when the queue was serviced but turned out to be empty.  Returns
+ * true if the BLUE state was active before but quiescent after this call.
+ */
+static bool cobalt_queue_empty(struct cobalt_vars *vars,
+			       struct cobalt_params *p,
+			       cobalt_time_t now)
+{
+	bool down = false;
+
+	if (vars->p_drop && (now - vars->blue_timer) > p->target) {
+		if (vars->p_drop < p->p_dec)
+			vars->p_drop = 0;
+		else
+			vars->p_drop -= p->p_dec;
+		vars->blue_timer = now;
+		down = !vars->p_drop;
+	}
+	vars->dropping = false;
+
+	if (vars->count && (now - vars->drop_next) >= 0) {
+		vars->count--;
+		cobalt_invsqrt(vars);
+		vars->drop_next = cobalt_control(vars->drop_next,
+						 p->interval,
+						 vars->rec_inv_sqrt);
+	}
+
+	return down;
+}
+
+/* Call this with a freshly dequeued packet for possible congestion marking.
+ * Returns true as an instruction to drop the packet, false for delivery.
+ */
+static bool cobalt_should_drop(struct cobalt_vars *vars,
+			       struct cobalt_params *p,
+			       cobalt_time_t now,
+			       struct sk_buff *skb,
+			       u32 bulk_flows)
+{
+	bool drop = false;
+
+	/* Simplified Codel implementation */
+	cobalt_tdiff_t sojourn  = now - cobalt_get_enqueue_time(skb);
+
+/* The 'schedule' variable records, in its sign, whether 'now' is before or
+ * after 'drop_next'.  This allows 'drop_next' to be updated before the next
+ * scheduling decision is actually branched, without destroying that
+ * information.  Similarly, the first 'schedule' value calculated is preserved
+ * in the boolean 'next_due'.
+ *
+ * As for 'drop_next', we take advantage of the fact that 'interval' is both
+ * the delay between first exceeding 'target' and the first signalling event,
+ * *and* the scaling factor for the signalling frequency.  It's therefore very
+ * natural to use a single mechanism for both purposes, and eliminates a
+ * significant amount of reference Codel's spaghetti code.  To help with this,
+ * both the '0' and '1' entries in the invsqrt cache are 0xFFFFFFFF, as close
+ * as possible to 1.0 in fixed-point.
+ */
+
+	cobalt_tdiff_t schedule = now - vars->drop_next;
+
+	bool over_target = sojourn > p->target &&
+			   sojourn > p->mtu_time * bulk_flows * 2 &&
+			   sojourn > p->mtu_time * 4;
+	bool next_due    = vars->count && schedule >= 0;
+
+	vars->ecn_marked = false;
+
+	if (over_target) {
+		if (!vars->dropping) {
+			vars->dropping = true;
+			vars->drop_next = cobalt_control(now,
+							 p->interval,
+							 vars->rec_inv_sqrt);
+		}
+		if (!vars->count)
+			vars->count = 1;
+	} else if (vars->dropping) {
+		vars->dropping = false;
+	}
+
+	if (next_due && vars->dropping) {
+		/* Use ECN mark if possible, otherwise drop */
+		drop = !(vars->ecn_marked = INET_ECN_set_ce(skb));
+
+		vars->count++;
+		if (!vars->count)
+			vars->count--;
+		cobalt_invsqrt(vars);
+		vars->drop_next = cobalt_control(vars->drop_next,
+						 p->interval,
+						 vars->rec_inv_sqrt);
+		schedule = now - vars->drop_next;
+	} else {
+		while (next_due) {
+			vars->count--;
+			cobalt_invsqrt(vars);
+			vars->drop_next = cobalt_control(vars->drop_next,
+							 p->interval,
+							 vars->rec_inv_sqrt);
+			schedule = now - vars->drop_next;
+			next_due = vars->count && schedule >= 0;
+		}
+	}
+
+	/* Simple BLUE implementation.  Lack of ECN is deliberate. */
+	if (vars->p_drop)
+		drop |= (prandom_u32() < vars->p_drop);
+
+	/* Overload the drop_next field as an activity timeout */
+	if (!vars->count)
+		vars->drop_next = now + p->interval;
+	else if (schedule > 0 && !drop)
+		vars->drop_next = now;
+
+	return drop;
+}
+
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+
+static inline void cake_update_flowkeys(struct flow_keys *keys,
+					const struct sk_buff *skb)
+{
+	enum ip_conntrack_info ctinfo;
+	bool rev = false;
+
+	struct nf_conn *ct;
+	const struct nf_conntrack_tuple *tuple;
+
+	if (tc_skb_protocol(skb) != htons(ETH_P_IP))
+		return;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (ct) {
+		tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
+	} else {
+		const struct nf_conntrack_tuple_hash *hash;
+		struct nf_conntrack_tuple srctuple;
+
+		if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
+				       NFPROTO_IPV4, dev_net(skb->dev),
+				       &srctuple))
+			return;
+
+		hash = nf_conntrack_find_get(dev_net(skb->dev),
+					     &nf_ct_zone_dflt,
+					     &srctuple);
+		if (!hash)
+			return;
+
+		rev = true;
+		ct = nf_ct_tuplehash_to_ctrack(hash);
+		tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
+	}
+
+	keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
+	keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
+
+	if (keys->ports.ports) {
+		keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
+		keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
+	}
+	if (rev)
+		nf_ct_put(ct);
+}
+#else
+static inline void cake_update_flowkeys(struct flow_keys *keys,
+					const struct sk_buff *skb)
+{
+	/* There is nothing we can do here without CONNTRACK */
+}
+#endif
+
+/* Cake has several subtle multiple bit settings. In these cases you
+ *  would be matching triple isolate mode as well.
+ */
+
+static inline bool cake_dsrc(int flow_mode)
+{
+	return (flow_mode & CAKE_FLOW_DUAL_SRC) == CAKE_FLOW_DUAL_SRC;
+}
+
+static inline bool cake_ddst(int flow_mode)
+{
+	return (flow_mode & CAKE_FLOW_DUAL_DST) == CAKE_FLOW_DUAL_DST;
+}
+
+static inline u32
+cake_hash(struct cake_tin_data *q, const struct sk_buff *skb, int flow_mode)
+{
+	struct flow_keys keys, host_keys;
+	u32 flow_hash = 0, srchost_hash, dsthost_hash;
+	u16 reduced_hash, srchost_idx, dsthost_idx;
+
+	if (unlikely(flow_mode == CAKE_FLOW_NONE))
+		return 0;
+
+	skb_flow_dissect_flow_keys(skb, &keys,
+				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
+
+	if (flow_mode & CAKE_FLOW_NAT_FLAG)
+		cake_update_flowkeys(&keys, skb);
+
+	/* flow_hash_from_keys() sorts the addresses by value, so we have
+	 * to preserve their order in a separate data structure to treat
+	 * src and dst host addresses as independently selectable.
+	 */
+	host_keys = keys;
+	host_keys.ports.ports     = 0;
+	host_keys.basic.ip_proto  = 0;
+	host_keys.keyid.keyid     = 0;
+	host_keys.tags.flow_label = 0;
+
+	switch (host_keys.control.addr_type) {
+	case FLOW_DISSECTOR_KEY_IPV4_ADDRS:
+		host_keys.addrs.v4addrs.src = 0;
+		dsthost_hash = flow_hash_from_keys(&host_keys);
+		host_keys.addrs.v4addrs.src = keys.addrs.v4addrs.src;
+		host_keys.addrs.v4addrs.dst = 0;
+		srchost_hash = flow_hash_from_keys(&host_keys);
+		break;
+
+	case FLOW_DISSECTOR_KEY_IPV6_ADDRS:
+		memset(&host_keys.addrs.v6addrs.src, 0,
+		       sizeof(host_keys.addrs.v6addrs.src));
+		dsthost_hash = flow_hash_from_keys(&host_keys);
+		host_keys.addrs.v6addrs.src = keys.addrs.v6addrs.src;
+		memset(&host_keys.addrs.v6addrs.dst, 0,
+		       sizeof(host_keys.addrs.v6addrs.dst));
+		srchost_hash = flow_hash_from_keys(&host_keys);
+		break;
+
+	default:
+		dsthost_hash = 0;
+		srchost_hash = 0;
+	};
+
+	/* This *must* be after the above switch, since as a
+	 * side-effect it sorts the src and dst addresses.
+	 */
+	if (flow_mode & CAKE_FLOW_FLOWS)
+		flow_hash = flow_hash_from_keys(&keys);
+
+	if (!(flow_mode & CAKE_FLOW_FLOWS)) {
+		if (flow_mode & CAKE_FLOW_SRC_IP)
+			flow_hash ^= srchost_hash;
+
+		if (flow_mode & CAKE_FLOW_DST_IP)
+			flow_hash ^= dsthost_hash;
+	}
+
+	reduced_hash = flow_hash % CAKE_QUEUES;
+
+	/* set-associative hashing */
+	/* fast path if no hash collision (direct lookup succeeds) */
+	if (likely(q->tags[reduced_hash] == flow_hash &&
+		   q->flows[reduced_hash].set)) {
+		q->way_directs++;
+	} else {
+		u32 inner_hash = reduced_hash % CAKE_SET_WAYS;
+		u32 outer_hash = reduced_hash - inner_hash;
+		u32 i, k;
+		bool allocate_src = false;
+		bool allocate_dst = false;
+
+		/* check if any active queue in the set is reserved for
+		 * this flow.
+		 */
+		for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
+		     i++, k = (k + 1) % CAKE_SET_WAYS) {
+			if (q->tags[outer_hash + k] == flow_hash) {
+				if (i)
+					q->way_hits++;
+
+				if (!q->flows[outer_hash + k].set) {
+					/* need to increment host refcnts */
+					allocate_src = cake_dsrc(flow_mode);
+					allocate_dst = cake_ddst(flow_mode);
+				}
+
+				goto found;
+			}
+		}
+
+		/* no queue is reserved for this flow, look for an
+		 * empty one.
+		 */
+		for (i = 0; i < CAKE_SET_WAYS;
+			 i++, k = (k + 1) % CAKE_SET_WAYS) {
+			if (!q->flows[outer_hash + k].set) {
+				q->way_misses++;
+				allocate_src = cake_dsrc(flow_mode);
+				allocate_dst = cake_ddst(flow_mode);
+				goto found;
+			}
+		}
+
+		/* With no empty queues, default to the original
+		 * queue, accept the collision, update the host tags.
+		 */
+		q->way_collisions++;
+		q->hosts[q->flows[reduced_hash].srchost].srchost_refcnt--;
+		q->hosts[q->flows[reduced_hash].dsthost].dsthost_refcnt--;
+		allocate_src = cake_dsrc(flow_mode);
+		allocate_dst = cake_ddst(flow_mode);
+found:
+		/* reserve queue for future packets in same flow */
+		reduced_hash = outer_hash + k;
+		q->tags[reduced_hash] = flow_hash;
+
+		if (allocate_src) {
+			srchost_idx = srchost_hash % CAKE_QUEUES;
+			inner_hash = srchost_idx % CAKE_SET_WAYS;
+			outer_hash = srchost_idx - inner_hash;
+			for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
+				i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (q->hosts[outer_hash + k].srchost_tag ==
+				    srchost_hash)
+					goto found_src;
+			}
+			for (i = 0; i < CAKE_SET_WAYS;
+				i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (!q->hosts[outer_hash + k].srchost_refcnt)
+					break;
+			}
+			q->hosts[outer_hash + k].srchost_tag = srchost_hash;
+found_src:
+			srchost_idx = outer_hash + k;
+			q->hosts[srchost_idx].srchost_refcnt++;
+			q->flows[reduced_hash].srchost = srchost_idx;
+		}
+
+		if (allocate_dst) {
+			dsthost_idx = dsthost_hash % CAKE_QUEUES;
+			inner_hash = dsthost_idx % CAKE_SET_WAYS;
+			outer_hash = dsthost_idx - inner_hash;
+			for (i = 0, k = inner_hash; i < CAKE_SET_WAYS;
+			     i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (q->hosts[outer_hash + k].dsthost_tag ==
+				    dsthost_hash)
+					goto found_dst;
+			}
+			for (i = 0; i < CAKE_SET_WAYS;
+			     i++, k = (k + 1) % CAKE_SET_WAYS) {
+				if (!q->hosts[outer_hash + k].dsthost_refcnt)
+					break;
+			}
+			q->hosts[outer_hash + k].dsthost_tag = dsthost_hash;
+found_dst:
+			dsthost_idx = outer_hash + k;
+			q->hosts[dsthost_idx].dsthost_refcnt++;
+			q->flows[reduced_hash].dsthost = dsthost_idx;
+		}
+	}
+
+	return reduced_hash;
+}
+
+/* helper functions : might be changed when/if skb use a standard list_head */
+/* remove one skb from head of slot queue */
+
+static inline struct sk_buff *dequeue_head(struct cake_flow *flow)
+{
+	struct sk_buff *skb = flow->head;
+
+	if (skb) {
+		flow->head = skb->next;
+		skb->next = NULL;
+
+		if (skb == flow->ackcheck)
+			flow->ackcheck = NULL;
+	}
+
+	return skb;
+}
+
+/* add skb to flow queue (tail add) */
+
+static inline void
+flow_queue_add(struct cake_flow *flow, struct sk_buff *skb)
+{
+	if (!flow->head)
+		flow->head = skb;
+	else
+		flow->tail->next = skb;
+	flow->tail = skb;
+	skb->next = NULL;
+}
+
+static struct sk_buff *cake_ack_filter(struct cake_sched_data *q,
+				       struct cake_flow *flow)
+{
+	int seglen;
+	struct sk_buff *skb = flow->tail, *skb_check, *skb_check_prev;
+	struct iphdr *iph, *iph_check;
+	struct ipv6hdr *ipv6h, *ipv6h_check;
+	struct tcphdr *tcph, *tcph_check;
+	bool otherconn_ack_seen = false;
+	struct sk_buff *otherconn_checked_to = NULL;
+	bool thisconn_redundant_seen = false, thisconn_seen_last = false;
+	struct sk_buff *thisconn_checked_to = NULL, *thisconn_ack = NULL;
+	bool aggressive = q->ack_filter == CAKE_ACK_AGGRESSIVE;
+
+	/* no other possible ACKs to filter */
+	if (flow->head == skb)
+		return NULL;
+
+	iph = skb->encapsulation ? inner_ip_hdr(skb) : ip_hdr(skb);
+	ipv6h = skb->encapsulation ? inner_ipv6_hdr(skb) : ipv6_hdr(skb);
+
+	/* check that the innermost network header is v4/v6, and contains TCP */
+	if (pskb_may_pull(skb, ((unsigned char *)iph - skb->head) + sizeof(struct iphdr)) &&
+	    iph->version == 4) {
+		if (iph->protocol != IPPROTO_TCP)
+			return NULL;
+		seglen = ntohs(iph->tot_len) - (4 * iph->ihl);
+		tcph = (struct tcphdr *)((void *)iph + (4 * iph->ihl));
+		if (!pskb_may_pull(skb, ((unsigned char *)tcph - skb->head) + sizeof(struct tcphdr)))
+			return NULL;
+	} else if (pskb_may_pull(skb, ((unsigned char *)ipv6h - skb->head) + sizeof(struct ipv6hdr) + sizeof(struct tcphdr)) &&
+	           ipv6h->version == 6) {
+		if (ipv6h->nexthdr != IPPROTO_TCP)
+			return NULL;
+		seglen = ntohs(ipv6h->payload_len);
+		tcph = (struct tcphdr *)((void *)ipv6h +
+					 sizeof(struct ipv6hdr));
+	} else {
+		return NULL;
+	}
+
+	/* the 'triggering' packet need only have the ACK flag set.
+	 * also check that SYN is not set, as there won't be any previous ACKs.
+	 */
+	if ((tcp_flag_word(tcph) &
+		(TCP_FLAG_ACK | TCP_FLAG_SYN)) != TCP_FLAG_ACK)
+		return NULL;
+
+	/* the 'triggering' ACK is at the end of the queue,
+	 * we have already returned if it is the only packet in the flow.
+	 * stop before last packet in queue, don't compare trigger ACK to itself
+	 * start where we finished last time if recorded in ->ackcheck
+	 * otherwise start from the the head of the flow queue.
+	 */
+	skb_check_prev = flow->ackcheck;
+	skb_check = flow->ackcheck ?: flow->head;
+
+	while (skb_check->next) {
+		bool pure_ack, thisconn;
+
+		/* don't increment if at head of flow queue (_prev == NULL) */
+		if (skb_check_prev) {
+			skb_check_prev = skb_check;
+			skb_check = skb_check->next;
+			if (!skb_check->next)
+				break;
+		} else {
+			skb_check_prev = ERR_PTR(-1);
+		}
+
+		iph_check = skb_check->encapsulation ?
+			inner_ip_hdr(skb_check) : ip_hdr(skb_check);
+		ipv6h_check = skb_check->encapsulation ?
+			inner_ipv6_hdr(skb_check) : ipv6_hdr(skb_check);
+
+		if (pskb_may_pull(skb_check, ((unsigned char *)iph_check - skb_check->head) + sizeof(struct iphdr)) &&
+		    iph_check->version == 4) {
+			if (iph_check->protocol != IPPROTO_TCP)
+				continue;
+			seglen = (ntohs(iph_check->tot_len) -
+				  (4 * iph_check->ihl));
+			tcph_check = (struct tcphdr *)((void *)iph_check
+				+ (4 * iph_check->ihl));
+			if (iph->version == 4 &&
+			    iph_check->saddr == iph->saddr &&
+			    iph_check->daddr == iph->daddr) {
+				thisconn = true;
+			} else {
+				thisconn = false;
+			}
+		} else if (pskb_may_pull(skb_check, ((unsigned char *)ipv6h_check - skb_check->head) + sizeof(struct ipv6hdr)) &&
+		           ipv6h_check->version == 6) {
+			if (ipv6h_check->nexthdr != IPPROTO_TCP)
+				continue;
+			seglen = ntohs(ipv6h_check->payload_len);
+			tcph_check = (struct tcphdr *)((void *)ipv6h_check +
+				     sizeof(struct ipv6hdr));
+			if (ipv6h->version == 6 &&
+			    ipv6_addr_cmp(&ipv6h_check->saddr, &ipv6h->saddr) &&
+				ipv6_addr_cmp(&ipv6h_check->daddr,
+					      &ipv6h->daddr)) {
+				thisconn = true;
+			} else {
+				thisconn = false;
+			}
+		} else {
+			continue;
+		}
+
+		if (!pskb_may_pull(skb_check, ((unsigned char *)tcph_check - skb_check->head) + sizeof(struct tcphdr)))
+			continue;
+
+		/* stricter criteria apply to ACKs that we may filter
+		 * 3 reserved flags must be unset to avoid future breakage
+		 * ECE/CWR/NS can be safely ignored
+		 * ACK must be set
+		 * All other flags URG/PSH/RST/SYN/FIN must be unset
+		 * 0x0FFF0000 = all TCP flags (confirm ACK=1, others zero)
+		 * 0x01C00000 = NS/CWR/ECE (safe to ignore)
+		 * 0x0E3F0000 = 0x0FFF0000 & ~0x01C00000
+		 * must be 'pure' ACK, contain zero bytes of segment data
+		 * options are ignored
+		 */
+		if ((tcp_flag_word(tcph_check) &
+			(TCP_FLAG_ACK | TCP_FLAG_SYN)) != TCP_FLAG_ACK) {
+			continue;
+		} else if (((tcp_flag_word(tcph_check) &
+				cpu_to_be32(0x0E3F0000)) != TCP_FLAG_ACK) ||
+			   ((seglen - 4 * tcph_check->doff) != 0)) {
+			pure_ack = false;
+		} else {
+			pure_ack = true;
+		}
+
+		/* if we find an ACK belonging to a different connection
+		 * continue checking for other ACKs this round however
+		 * restart checking from the other connection next time.
+		 */
+		if (thisconn &&	(tcph_check->source != tcph->source ||
+				 tcph_check->dest != tcph->dest)) {
+			thisconn = false;
+		}
+
+		/* new ack sequence must be greater
+		 */
+		if (thisconn &&
+		    ((int32_t)(ntohl(tcph_check->ack_seq) - ntohl(tcph->ack_seq)) > 0))
+			continue;
+
+		/* DupACKs with an equal sequence number shouldn't be filtered,
+		 * but we can filter if the triggering packet is a SACK
+		 */
+		if (thisconn &&
+		    (ntohl(tcph_check->ack_seq) == ntohl(tcph->ack_seq)) &&
+		    pskb_may_pull(skb, ((unsigned char *)tcph - skb->head) + (tcph->doff * 4))) {
+			/* inspired by tcp_parse_options in tcp_input.c */
+			bool sack = false;
+			int length = (tcph->doff * 4) - sizeof(struct tcphdr);
+			const u8 *ptr = (const u8 *)(tcph + 1);
+
+			while (length > 0) {
+				int opcode = *ptr++;
+				int opsize;
+
+				if (opcode == TCPOPT_EOL)
+					break;
+				if (opcode == TCPOPT_NOP) {
+					length--;
+					continue;
+				}
+				opsize = *ptr++;
+				if (opsize < 2 || opsize > length)
+					break;
+				if (opcode == TCPOPT_SACK) {
+					sack = true;
+					break;
+				}
+				ptr += opsize - 2;
+				length -= opsize;
+			}
+			if (!sack)
+				continue;
+		}
+
+		/* somewhat complicated control flow for 'conservative'
+		 * ACK filtering that aims to be more polite to slow-start and
+		 * in the presence of packet loss.
+		 * does not filter if there is one 'redundant' ACK in the queue.
+		 * 'data' ACKs won't be filtered but do count as redundant ACKs.
+		 */
+		if (thisconn) {
+			thisconn_seen_last = true;
+			/* if aggressive and this is a data ack we can skip
+			 * checking it next time.
+			 */
+			thisconn_checked_to = (aggressive && !pure_ack) ?
+				skb_check : skb_check_prev;
+			/* the first pure ack for this connection.
+			 * record where it is, but only break if aggressive
+			 * or already seen data ack from the same connection
+			 */
+			if (pure_ack && !thisconn_ack) {
+				thisconn_ack = skb_check_prev;
+				if (aggressive || thisconn_redundant_seen)
+					break;
+			/* data ack or subsequent pure ack */
+			} else {
+				thisconn_redundant_seen = true;
+				/* this is the second ack for this connection
+				 * break to filter the first pure ack
+				 */
+				if (thisconn_ack)
+					break;
+			}
+		/* track packets from non-matching tcp connections that will
+		 * need evaluation on the next run.
+		 * if there are packets from both the matching connection and
+		 * others that requre checking next run, track which was updated
+		 * last and return the older of the two to ensure full coverage.
+		 * if a non-matching pure ack has been seen, cannot skip any
+		 * further on the next run so don't update.
+		 */
+		} else if (!otherconn_ack_seen) {
+			thisconn_seen_last = false;
+			if (pure_ack) {
+				otherconn_ack_seen = true;
+				/* if aggressive we don't care about old data,
+				 * start from the pure ack.
+				 * otherwise if there is a previous data ack,
+				 * start checking from it next time.
+				 */
+				if (aggressive || !otherconn_checked_to)
+					otherconn_checked_to = skb_check_prev;
+			} else {
+				otherconn_checked_to = aggressive ?
+					skb_check : skb_check_prev;
+			}
+		}
+	}
+
+	/* skb_check is reused at this point
+	 * it is the pure ACK to be filtered (if any)
+	 */
+	skb_check = NULL;
+
+	/* next time start checking from the older/nearest to head of unfiltered
+	 * but important tcp packets from this connection and other connections.
+	 * if none seen, start after the last packet evaluated in the loop.
+	 */
+	if (thisconn_checked_to && otherconn_checked_to)
+		flow->ackcheck = thisconn_seen_last ?
+			otherconn_checked_to : thisconn_checked_to;
+	else if (thisconn_checked_to)
+		flow->ackcheck = thisconn_checked_to;
+	else if (otherconn_checked_to)
+		flow->ackcheck = otherconn_checked_to;
+	else
+		flow->ackcheck = skb_check_prev;
+
+	/* if filtering, the pure ACK from the flow queue */
+	if (thisconn_ack && (aggressive || thisconn_redundant_seen)) {
+		if (PTR_ERR(thisconn_ack) == -1) {
+			skb_check = flow->head;
+			flow->head = flow->head->next;
+		} else {
+			skb_check = thisconn_ack->next;
+			thisconn_ack->next = thisconn_ack->next->next;
+		}
+	}
+
+	/* we just filtered that ack, fix up the list */
+	if (flow->ackcheck == skb_check)
+		flow->ackcheck = thisconn_ack;
+	/* check the entire flow queue next time */
+	if (PTR_ERR(flow->ackcheck) == -1)
+		flow->ackcheck = NULL;
+
+	return skb_check;
+}
+
+static inline cobalt_time_t cake_ewma(cobalt_time_t avg, cobalt_time_t sample,
+				      u32 shift)
+{
+	avg -= avg >> shift;
+	avg += sample >> shift;
+	return avg;
+}
+
+static inline u32 cake_overhead(struct cake_sched_data *q, struct sk_buff *skb)
+{
+	const struct skb_shared_info *shinfo = skb_shinfo(skb);
+	u32 off = skb_network_offset(skb);
+	u32 len = qdisc_pkt_len(skb);
+	u16 segs = 1;
+
+	if (unlikely(shinfo->gso_size)) {
+		/* borrowed from qdisc_pkt_len_init() */
+		unsigned int hdr_len;
+
+		hdr_len = skb_transport_header(skb) - skb_mac_header(skb);
+
+                /* + transport layer */
+                if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) {
+                        const struct tcphdr *th;
+                        struct tcphdr _tcphdr;
+
+                        th = skb_header_pointer(skb, skb_transport_offset(skb),
+                                                sizeof(_tcphdr), &_tcphdr);
+                        if (likely(th))
+                                hdr_len += __tcp_hdrlen(th);
+                } else {
+                        struct udphdr _udphdr;
+
+                        if (skb_header_pointer(skb, skb_transport_offset(skb),
+                                               sizeof(_udphdr), &_udphdr))
+                                hdr_len += sizeof(struct udphdr);
+                }
+
+		if (unlikely(shinfo->gso_type & SKB_GSO_DODGY))
+			segs = DIV_ROUND_UP(skb->len - hdr_len,
+                                                shinfo->gso_size);
+		else
+			segs = shinfo->gso_segs;
+
+		/* The last segment may be shorter; we ignore this, which means
+		 * that we will over-estimate the size of the whole GSO segment
+		 * by the difference in size. This is conservative, so we live
+		 * with that to avoid the complexity of dealing with it.
+		 */
+		len = shinfo->gso_size + hdr_len;
+	}
+
+	q->avg_netoff = cake_ewma(q->avg_netoff, off << 16, 8);
+
+	if (q->rate_flags & CAKE_FLAG_OVERHEAD)
+		len -= off;
+
+	if (q->max_netlen < len)
+		q->max_netlen = len;
+	if (q->min_netlen > len)
+		q->min_netlen = len;
+
+	len += q->rate_overhead;
+
+	if (len < q->rate_mpu)
+		len = q->rate_mpu;
+
+	if (q->atm_mode == CAKE_ATM_ATM) {
+		len += 47;
+		len /= 48;
+		len *= 53;
+	} else if (q->atm_mode == CAKE_ATM_PTM) {
+		/* Add one byte per 64 bytes or part thereof.
+		 * This is conservative and easier to calculate than the
+		 * precise value.
+		 */
+		len += (len + 63) / 64;
+	}
+
+	if (q->max_adjlen < len)
+		q->max_adjlen = len;
+	if (q->min_adjlen > len)
+		q->min_adjlen = len;
+
+	get_cobalt_cb(skb)->adjusted_len = len * segs;
+	return len;
+}
+
+static inline void cake_heap_swap(struct cake_sched_data *q, u16 i, u16 j)
+{
+	struct cake_heap_entry ii = q->overflow_heap[i];
+	struct cake_heap_entry jj = q->overflow_heap[j];
+
+	q->overflow_heap[i] = jj;
+	q->overflow_heap[j] = ii;
+
+	q->tins[ii.t].overflow_idx[ii.b] = j;
+	q->tins[jj.t].overflow_idx[jj.b] = i;
+}
+
+static inline u32 cake_heap_get_backlog(const struct cake_sched_data *q, u16 i)
+{
+	struct cake_heap_entry ii = q->overflow_heap[i];
+
+	return q->tins[ii.t].backlogs[ii.b];
+}
+
+static void cake_heapify(struct cake_sched_data *q, u16 i)
+{
+	static const u32 a = CAKE_MAX_TINS * CAKE_QUEUES;
+	u32 m = i;
+	u32 mb = cake_heap_get_backlog(q, m);
+
+	while (m < a) {
+		u32 l = m + m + 1;
+		u32 r = l + 1;
+
+		if (l < a) {
+			u32 lb = cake_heap_get_backlog(q, l);
+
+			if (lb > mb) {
+				m  = l;
+				mb = lb;
+			}
+		}
+
+		if (r < a) {
+			u32 rb = cake_heap_get_backlog(q, r);
+
+			if (rb > mb) {
+				m  = r;
+				mb = rb;
+			}
+		}
+
+		if (m != i) {
+			cake_heap_swap(q, i, m);
+			i = m;
+		} else {
+			break;
+		}
+	}
+}
+
+static void cake_heapify_up(struct cake_sched_data *q, u16 i)
+{
+	while (i > 0 && i < CAKE_MAX_TINS * CAKE_QUEUES) {
+		u16 p = (i - 1) >> 1;
+		u32 ib = cake_heap_get_backlog(q, i);
+		u32 pb = cake_heap_get_backlog(q, p);
+
+		if (ib > pb) {
+			cake_heap_swap(q, i, p);
+			i = p;
+		} else {
+			break;
+		}
+	}
+}
+
+static int cake_advance_shaper(struct cake_sched_data *q,
+			       struct cake_tin_data *b,
+			       struct sk_buff *skb,
+			       u64 now, bool drop)
+{
+	u32 len = get_cobalt_cb(skb)->adjusted_len;
+
+	/* charge packet bandwidth to this tin
+	 * and to the global shaper.
+	 */
+	if (q->rate_ns) {
+		s64 tdiff1 = b->tin_time_next_packet - now;
+		s64 tdiff2 = (len * (u64)b->tin_rate_ns) >> b->tin_rate_shft;
+		s64 tdiff3 = (len * (u64)q->rate_ns) >> q->rate_shft;
+		s64 tdiff4 = tdiff3 + (tdiff3 >> 1);
+
+		if (tdiff1 < 0)
+			b->tin_time_next_packet += tdiff2;
+		else if (tdiff1 < tdiff2)
+			b->tin_time_next_packet = now + tdiff2;
+
+		q->time_next_packet += tdiff3;
+		if (!drop)
+			q->failsafe_next_packet += tdiff4;
+	}
+	return len;
+}
+
+static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	u32 idx = 0, tin = 0, len;
+	struct cake_tin_data *b;
+	struct cake_flow *flow;
+	struct cake_heap_entry qq;
+	u64 now = cobalt_get_time();
+
+	if (!q->overflow_timeout) {
+		int i;
+		/* Build fresh max-heap */
+		for (i = CAKE_MAX_TINS * CAKE_QUEUES / 2; i >= 0; i--)
+			cake_heapify(q, i);
+	}
+	q->overflow_timeout = 65535;
+
+	/* select longest queue for pruning */
+	qq  = q->overflow_heap[0];
+	tin = qq.t;
+	idx = qq.b;
+
+	b = &q->tins[tin];
+	flow = &b->flows[idx];
+	skb = dequeue_head(flow);
+	if (unlikely(!skb)) {
+		/* heap has gone wrong, rebuild it next time */
+		q->overflow_timeout = 0;
+		return idx + (tin << 16);
+	}
+
+	if (cobalt_queue_full(&flow->cvars, &b->cparams, now))
+		b->unresponsive_flow_count++;
+
+	len = qdisc_pkt_len(skb);
+	q->buffer_used      -= skb->truesize;
+	b->backlogs[idx]    -= len;
+	b->tin_backlog      -= len;
+	sch->qstats.backlog -= len;
+	qdisc_tree_reduce_backlog(sch, 1, len);
+
+	b->tin_dropped++;
+	sch->qstats.drops++;
+
+	if (q->rate_flags & CAKE_FLAG_INGRESS)
+		cake_advance_shaper(q, b, skb, now, true);
+
+	__qdisc_drop(skb, to_free);
+	sch->q.qlen--;
+
+	cake_heapify(q, 0);
+
+	return idx + (tin << 16);
+}
+
+static inline void cake_wash_diffserv(struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+		break;
+	case htons(ETH_P_IPV6):
+		ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+		break;
+	default:
+		break;
+	};
+}
+
+static inline u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash)
+{
+	u8 dscp;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		dscp = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+		if (wash && dscp)
+			ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+		return dscp;
+
+	case htons(ETH_P_IPV6):
+		dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+		if (wash && dscp)
+			ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+		return dscp;
+
+	case htons(ETH_P_ARP):
+		return 0x38;  /* CS7 - Net Control */
+
+	default:
+		/* If there is no Diffserv field, treat as best-effort */
+		return 0;
+	};
+}
+
+static void cake_reconfigure(struct Qdisc *sch);
+
+static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
+			struct sk_buff **to_free)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 idx, tin;
+	struct cake_tin_data *b;
+	struct cake_flow *flow;
+	/* signed len to handle corner case filtered ACK larger than trigger */
+	int len = qdisc_pkt_len(skb);
+	u64 now = cobalt_get_time();
+	struct sk_buff *ack = NULL;
+
+	/* extract the Diffserv Precedence field, if it exists */
+	/* and clear DSCP bits if washing */
+	if (q->tin_mode != CAKE_DIFFSERV_BESTEFFORT) {
+		tin = q->tin_index[cake_handle_diffserv(skb,
+				q->rate_flags & CAKE_FLAG_WASH)];
+		if (unlikely(tin >= q->tin_cnt))
+			tin = 0;
+	} else {
+		tin = 0;
+		if (q->rate_flags & CAKE_FLAG_WASH)
+			cake_wash_diffserv(skb);
+	}
+
+	b = &q->tins[tin];
+
+	/* choose flow to insert into */
+	idx = cake_hash(b, skb, q->flow_mode);
+	flow = &b->flows[idx];
+
+	/* ensure shaper state isn't stale */
+	if (!b->tin_backlog) {
+		if (b->tin_time_next_packet < now)
+			b->tin_time_next_packet = now;
+
+		if (!sch->q.qlen) {
+			if (q->time_next_packet < now) {
+				q->failsafe_next_packet = now;
+				q->time_next_packet = now;
+			} else if (q->time_next_packet > now &&
+				   q->failsafe_next_packet > now) {
+				u64 next = min(q->time_next_packet,
+					       q->failsafe_next_packet);
+				sch->qstats.overlimits++;
+				qdisc_watchdog_schedule_ns(&q->watchdog, next);
+			}
+		}
+	}
+
+	if (unlikely(len > b->max_skblen))
+		b->max_skblen = len;
+
+	/* Split GSO aggregates if they're likely to impair flow isolation */
+
+	if (skb_is_gso(skb) && q->rate_flags & CAKE_FLAG_SPLIT_GSO) {
+		struct sk_buff *segs, *nskb;
+		netdev_features_t features = netif_skb_features(skb);
+		unsigned int slen = 0;
+
+		segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
+		if (IS_ERR_OR_NULL(segs))
+			return qdisc_drop(skb, sch, to_free);
+
+		while (segs) {
+			nskb = segs->next;
+			segs->next = NULL;
+			qdisc_skb_cb(segs)->pkt_len = segs->len;
+			cobalt_set_enqueue_time(segs, now);
+			get_cobalt_cb(segs)->adjusted_len = cake_overhead(q,
+									  segs);
+			flow_queue_add(flow, segs);
+
+			sch->q.qlen++;
+			slen += segs->len;
+			q->buffer_used += segs->truesize;
+			b->packets++;
+			segs = nskb;
+		}
+		/* stats */
+		b->bytes	    += slen;
+		b->backlogs[idx]    += slen;
+		b->tin_backlog      += slen;
+		sch->qstats.backlog += slen;
+		q->avg_window_bytes += slen;
+
+		qdisc_tree_reduce_backlog(sch, 1, len);
+		consume_skb(skb);
+	} else {
+		/* not splitting */
+		cobalt_set_enqueue_time(skb, now);
+		get_cobalt_cb(skb)->adjusted_len = cake_overhead(q, skb);
+		flow_queue_add(flow, skb);
+
+		if (q->ack_filter)
+			ack = cake_ack_filter(q, flow);
+
+		if (ack) {
+			b->ack_drops++;
+			sch->qstats.drops++;
+			b->bytes += qdisc_pkt_len(ack);
+			len -= qdisc_pkt_len(ack);
+			q->buffer_used += skb->truesize - ack->truesize;
+			if (q->rate_flags & CAKE_FLAG_INGRESS)
+				cake_advance_shaper(q, b, ack, now, true);
+
+			qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack));
+			consume_skb(ack);
+		} else {
+			sch->q.qlen++;
+			q->buffer_used      += skb->truesize;
+		}
+		/* stats */
+		b->packets++;
+		b->bytes	    += len;
+		b->backlogs[idx]    += len;
+		b->tin_backlog      += len;
+		sch->qstats.backlog += len;
+		q->avg_window_bytes += len;
+	}
+
+	if (q->overflow_timeout)
+		cake_heapify_up(q, b->overflow_idx[idx]);
+
+	/* incoming bandwidth capacity estimate */
+	if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS) {
+		u64 packet_interval = now - q->last_packet_time;
+
+		if (packet_interval > NSEC_PER_SEC)
+			packet_interval = NSEC_PER_SEC;
+
+		/* filter out short-term bursts, eg. wifi aggregation */
+		q->avg_packet_interval = cake_ewma(q->avg_packet_interval,
+						   packet_interval,
+			packet_interval > q->avg_packet_interval ? 2 : 8);
+
+		q->last_packet_time = now;
+
+		if (packet_interval > q->avg_packet_interval) {
+			u64 window_interval = now - q->avg_window_begin;
+			u64 b = q->avg_window_bytes * (u64)NSEC_PER_SEC;
+
+			do_div(b, window_interval);
+			q->avg_peak_bandwidth =
+				cake_ewma(q->avg_peak_bandwidth, b,
+					  b > q->avg_peak_bandwidth ? 2 : 8);
+			q->avg_window_bytes = 0;
+			q->avg_window_begin = now;
+
+			if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS &&
+			    now - q->last_reconfig_time >
+				(NSEC_PER_SEC / 4)) {
+				q->rate_bps = (q->avg_peak_bandwidth * 15) >> 4;
+				cake_reconfigure(sch);
+			}
+		}
+	} else {
+		q->avg_window_bytes = 0;
+		q->last_packet_time = now;
+	}
+
+	/* flowchain */
+	if (!flow->set || flow->set == CAKE_SET_DECAYING) {
+		struct cake_host *srchost = &b->hosts[flow->srchost];
+		struct cake_host *dsthost = &b->hosts[flow->dsthost];
+		u16 host_load = 1;
+
+		if (!flow->set) {
+			list_add_tail(&flow->flowchain, &b->new_flows);
+		} else {
+			b->decaying_flow_count--;
+			list_move_tail(&flow->flowchain, &b->new_flows);
+		}
+		flow->set = CAKE_SET_SPARSE;
+		b->sparse_flow_count++;
+
+		if (cake_dsrc(q->flow_mode))
+			host_load = max(host_load, srchost->srchost_refcnt);
+
+		if (cake_ddst(q->flow_mode))
+			host_load = max(host_load, dsthost->dsthost_refcnt);
+
+		flow->deficit = (b->flow_quantum *
+				 quantum_div[host_load]) >> 16;
+	} else if (flow->set == CAKE_SET_SPARSE_WAIT) {
+		/* this flow was empty, accounted as a sparse flow, but actually
+		 * in the bulk rotation.
+		 */
+		flow->set = CAKE_SET_BULK;
+		b->sparse_flow_count--;
+		b->bulk_flow_count++;
+	}
+
+	if (q->buffer_used > q->buffer_max_used)
+		q->buffer_max_used = q->buffer_used;
+
+	if (q->buffer_used > q->buffer_limit) {
+		u32 dropped = 0;
+
+		while (q->buffer_used > q->buffer_limit) {
+			dropped++;
+			cake_drop(sch, to_free);
+		}
+		b->drop_overlimit += dropped;
+	}
+	return NET_XMIT_SUCCESS;
+}
+
+static struct sk_buff *cake_dequeue_one(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct cake_tin_data *b = &q->tins[q->cur_tin];
+	struct cake_flow *flow = &b->flows[q->cur_flow];
+	struct sk_buff *skb = NULL;
+	u32 len;
+
+	if (flow->head) {
+		skb = dequeue_head(flow);
+		len = qdisc_pkt_len(skb);
+		b->backlogs[q->cur_flow] -= len;
+		b->tin_backlog		 -= len;
+		sch->qstats.backlog      -= len;
+		q->buffer_used		 -= skb->truesize;
+		sch->q.qlen--;
+
+		if (q->overflow_timeout)
+			cake_heapify(q, b->overflow_idx[q->cur_flow]);
+	}
+	return skb;
+}
+
+/* Discard leftover packets from a tin no longer in use. */
+static void cake_clear_tin(struct Qdisc *sch, u16 tin)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+
+	q->cur_tin = tin;
+	for (q->cur_flow = 0; q->cur_flow < CAKE_QUEUES; q->cur_flow++)
+		while (!!(skb = cake_dequeue_one(sch)))
+			kfree_skb(skb);
+}
+
+static struct sk_buff *cake_dequeue(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct sk_buff *skb;
+	struct cake_tin_data *b = &q->tins[q->cur_tin];
+	struct cake_flow *flow;
+	struct cake_host *srchost, *dsthost;
+	struct list_head *head;
+	u32 len;
+	u16 host_load;
+	cobalt_time_t now = ktime_get_ns();
+	cobalt_time_t delay;
+	bool first_flow = true;
+
+begin:
+	if (!sch->q.qlen)
+		return NULL;
+
+	/* global hard shaper */
+	if (q->time_next_packet > now && q->failsafe_next_packet > now) {
+		u64 next = min(q->time_next_packet, q->failsafe_next_packet);
+
+		sch->qstats.overlimits++;
+		qdisc_watchdog_schedule_ns(&q->watchdog, next);
+		return NULL;
+	}
+
+	/* Choose a class to work on. */
+	if (!q->rate_ns) {
+		/* In unlimited mode, can't rely on shaper timings, just balance
+		 * with DRR
+		 */
+		while (b->tin_deficit < 0 ||
+		       !(b->sparse_flow_count + b->bulk_flow_count)) {
+			if (b->tin_deficit <= 0)
+				b->tin_deficit += b->tin_quantum_band;
+
+			q->cur_tin++;
+			b++;
+			if (q->cur_tin >= q->tin_cnt) {
+				q->cur_tin = 0;
+				b = q->tins;
+			}
+		}
+	} else {
+		/* In shaped mode, choose:
+		 * - Highest-priority tin with queue and meeting schedule, or
+		 * - The earliest-scheduled tin with queue.
+		 */
+		int tin, best_tin = 0;
+		s64 best_time = 0xFFFFFFFFFFFFUL;
+
+		for (tin = 0; tin < q->tin_cnt; tin++) {
+			b = q->tins + tin;
+			if ((b->sparse_flow_count + b->bulk_flow_count) > 0) {
+				s64 tdiff = b->tin_time_next_packet - now;
+
+				if (tdiff <= 0 || tdiff <= best_time) {
+					best_time = tdiff;
+					best_tin = tin;
+				}
+			}
+		}
+
+		q->cur_tin = best_tin;
+		b = q->tins + best_tin;
+	}
+
+retry:
+	/* service this class */
+	head = &b->decaying_flows;
+	if (!first_flow || list_empty(head)) {
+		head = &b->new_flows;
+		if (list_empty(head)) {
+			head = &b->old_flows;
+			if (unlikely(list_empty(head))) {
+				head = &b->decaying_flows;
+				if (unlikely(list_empty(head)))
+					goto begin;
+			}
+		}
+	}
+	flow = list_first_entry(head, struct cake_flow, flowchain);
+	q->cur_flow = flow - b->flows;
+	first_flow = false;
+
+	/* triple isolation (modified DRR++) */
+	srchost = &b->hosts[flow->srchost];
+	dsthost = &b->hosts[flow->dsthost];
+	host_load = 1;
+
+	if (cake_dsrc(q->flow_mode))
+		host_load = max(host_load, srchost->srchost_refcnt);
+
+	if (cake_ddst(q->flow_mode))
+		host_load = max(host_load, dsthost->dsthost_refcnt);
+
+	WARN_ON(host_load > CAKE_QUEUES);
+
+	/* flow isolation (DRR++) */
+	if (flow->deficit <= 0) {
+		/* The shifted prandom_u32() is a way to apply dithering to
+		 * avoid accumulating roundoff errors
+		 */
+		flow->deficit += (b->flow_quantum * quantum_div[host_load] +
+				  (prandom_u32() >> 16)) >> 16;
+		list_move_tail(&flow->flowchain, &b->old_flows);
+
+		/* Keep all flows with deficits out of the sparse and decaying
+		 * rotations.  No non-empty flow can go into the decaying
+		 * rotation, so they can't get deficits
+		 */
+		if (flow->set == CAKE_SET_SPARSE) {
+			if (flow->head) {
+				b->sparse_flow_count--;
+				b->bulk_flow_count++;
+				flow->set = CAKE_SET_BULK;
+			} else {
+				/* we've moved it to the bulk rotation for
+				 * correct deficit accounting but we still want
+				 * to count it as a sparse flow, not a bulk one.
+				 */
+				flow->set = CAKE_SET_SPARSE_WAIT;
+			}
+		}
+		goto retry;
+	}
+
+	/* Retrieve a packet via the AQM */
+	while (1) {
+		skb = cake_dequeue_one(sch);
+		if (!skb) {
+			/* this queue was actually empty */
+			if (cobalt_queue_empty(&flow->cvars, &b->cparams, now))
+				b->unresponsive_flow_count--;
+
+			if (flow->cvars.p_drop || flow->cvars.count ||
+			    now < flow->cvars.drop_next) {
+				/* keep in the flowchain until the state has
+				 * decayed to rest
+				 */
+				list_move_tail(&flow->flowchain,
+					       &b->decaying_flows);
+				if (flow->set == CAKE_SET_BULK) {
+					b->bulk_flow_count--;
+					b->decaying_flow_count++;
+				} else if (flow->set == CAKE_SET_SPARSE ||
+					   flow->set == CAKE_SET_SPARSE_WAIT) {
+					b->sparse_flow_count--;
+					b->decaying_flow_count++;
+				}
+				flow->set = CAKE_SET_DECAYING;
+			} else {
+				/* remove empty queue from the flowchain */
+				list_del_init(&flow->flowchain);
+				if (flow->set == CAKE_SET_SPARSE ||
+				    flow->set == CAKE_SET_SPARSE_WAIT)
+					b->sparse_flow_count--;
+				else if (flow->set == CAKE_SET_BULK)
+					b->bulk_flow_count--;
+				else
+					b->decaying_flow_count--;
+
+				flow->set = CAKE_SET_NONE;
+				srchost->srchost_refcnt--;
+				dsthost->dsthost_refcnt--;
+			}
+			goto begin;
+		}
+
+		/* Last packet in queue may be marked, shouldn't be dropped */
+		if (!cobalt_should_drop(&flow->cvars, &b->cparams, now, skb,
+					(b->bulk_flow_count *
+					 !!(q->rate_flags & CAKE_FLAG_INGRESS))) ||
+		    !flow->head)
+			break;
+
+		/* drop this packet, get another one */
+		if (q->rate_flags & CAKE_FLAG_INGRESS) {
+			len = cake_advance_shaper(q, b, skb,
+						  now, true);
+			flow->deficit -= len;
+			b->tin_deficit -= len;
+		}
+		b->tin_dropped++;
+		qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
+		qdisc_qstats_drop(sch);
+		kfree_skb(skb);
+		if (q->rate_flags & CAKE_FLAG_INGRESS)
+			goto retry;
+	}
+
+	b->tin_ecn_mark += !!flow->cvars.ecn_marked;
+	qdisc_bstats_update(sch, skb);
+
+	/* collect delay stats */
+	delay = now - cobalt_get_enqueue_time(skb);
+	b->avge_delay = cake_ewma(b->avge_delay, delay, 8);
+	b->peak_delay = cake_ewma(b->peak_delay, delay,
+				  delay > b->peak_delay ? 2 : 8);
+	b->base_delay = cake_ewma(b->base_delay, delay,
+				  delay < b->base_delay ? 2 : 8);
+
+	len = cake_advance_shaper(q, b, skb, now, false);
+	flow->deficit -= len;
+	b->tin_deficit -= len;
+
+	if (q->time_next_packet > now && sch->q.qlen) {
+		u64 next = min(q->time_next_packet, q->failsafe_next_packet);
+
+		qdisc_watchdog_schedule_ns(&q->watchdog, next);
+	} else if (!sch->q.qlen) {
+		int i;
+
+		for (i = 0; i < q->tin_cnt; i++) {
+			if (q->tins[i].decaying_flow_count) {
+				u64 next = now + q->tins[i].cparams.target;
+
+				qdisc_watchdog_schedule_ns(&q->watchdog, next);
+				break;
+			}
+		}
+	}
+
+	if (q->overflow_timeout)
+		q->overflow_timeout--;
+
+	return skb;
+}
+
+static void cake_reset(struct Qdisc *sch)
+{
+	u32 c;
+
+	for (c = 0; c < CAKE_MAX_TINS; c++)
+		cake_clear_tin(sch, c);
+}
+
+static const struct nla_policy cake_policy[TCA_CAKE_MAX + 1] = {
+	[TCA_CAKE_BASE_RATE]     = { .type = NLA_U32 },
+	[TCA_CAKE_DIFFSERV_MODE] = { .type = NLA_U32 },
+	[TCA_CAKE_ATM]		 = { .type = NLA_U32 },
+	[TCA_CAKE_FLOW_MODE]     = { .type = NLA_U32 },
+	[TCA_CAKE_OVERHEAD]      = { .type = NLA_S32 },
+	[TCA_CAKE_RTT]		 = { .type = NLA_U32 },
+	[TCA_CAKE_TARGET]	 = { .type = NLA_U32 },
+	[TCA_CAKE_AUTORATE]      = { .type = NLA_U32 },
+	[TCA_CAKE_MEMORY]	 = { .type = NLA_U32 },
+	[TCA_CAKE_NAT]		 = { .type = NLA_U32 },
+	[TCA_CAKE_RAW]       = { .type = NLA_U32 },
+	[TCA_CAKE_WASH]		 = { .type = NLA_U32 },
+	[TCA_CAKE_MPU]		 = { .type = NLA_U32 },
+	[TCA_CAKE_INGRESS]	 = { .type = NLA_U32 },
+	[TCA_CAKE_ACK_FILTER]	 = { .type = NLA_U32 },
+};
+
+static void cake_set_rate(struct cake_tin_data *b, u64 rate, u32 mtu,
+			  cobalt_time_t ns_target, cobalt_time_t rtt_est_ns)
+{
+	/* convert byte-rate into time-per-byte
+	 * so it will always unwedge in reasonable time.
+	 */
+	static const u64 MIN_RATE = 64;
+	u64 rate_ns = 0;
+	u8  rate_shft = 0;
+	cobalt_time_t byte_target_ns;
+	u32 byte_target = mtu;
+
+	b->flow_quantum = 1514;
+	if (rate) {
+		b->flow_quantum = max(min(rate >> 12, 1514ULL), 300ULL);
+		rate_shft = 32;
+		rate_ns = ((u64)NSEC_PER_SEC) << rate_shft;
+		do_div(rate_ns, max(MIN_RATE, rate));
+		while (!!(rate_ns >> 32)) {
+			rate_ns >>= 1;
+			rate_shft--;
+		}
+	} /* else unlimited, ie. zero delay */
+
+	b->tin_rate_bps  = rate;
+	b->tin_rate_ns   = rate_ns;
+	b->tin_rate_shft = rate_shft;
+
+	byte_target_ns = (byte_target * rate_ns) >> rate_shft;
+
+	b->cparams.target = max((byte_target_ns * 3) / 2, ns_target);
+	b->cparams.interval = max(rtt_est_ns +
+				     b->cparams.target - ns_target,
+				     b->cparams.target * 2);
+	b->cparams.mtu_time = byte_target_ns;
+	b->cparams.p_inc = 1 << 24; /* 1/256 */
+	b->cparams.p_dec = 1 << 20; /* 1/4096 */
+}
+
+static int cake_config_besteffort(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct cake_tin_data *b = &q->tins[0];
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+
+	q->tin_cnt = 1;
+
+	q->tin_index = besteffort;
+	q->tin_order = normal_order;
+
+	cake_set_rate(b, rate, mtu, US2TIME(q->target), US2TIME(q->interval));
+	b->tin_quantum_band = 65535;
+	b->tin_quantum_prio = 65535;
+
+	return 0;
+}
+
+static int cake_config_precedence(struct Qdisc *sch)
+{
+	/* convert high-level (user visible) parameters into internal format */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum1 = 256;
+	u32 quantum2 = 256;
+	u32 i;
+
+	q->tin_cnt = 8;
+	q->tin_index = precedence;
+	q->tin_order = normal_order;
+
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[i];
+
+		cake_set_rate(b, rate, mtu, US2TIME(q->target),
+			      US2TIME(q->interval));
+
+		b->tin_quantum_prio = max_t(u16, 1U, quantum1);
+		b->tin_quantum_band = max_t(u16, 1U, quantum2);
+
+		/* calculate next class's parameters */
+		rate  *= 7;
+		rate >>= 3;
+
+		quantum1  *= 3;
+		quantum1 >>= 1;
+
+		quantum2  *= 7;
+		quantum2 >>= 3;
+	}
+
+	return 0;
+}
+
+/*	List of known Diffserv codepoints:
+ *
+ *	Least Effort (CS1)
+ *	Best Effort (CS0)
+ *	Max Reliability & LLT "Lo" (TOS1)
+ *	Max Throughput (TOS2)
+ *	Min Delay (TOS4)
+ *	LLT "La" (TOS5)
+ *	Assured Forwarding 1 (AF1x) - x3
+ *	Assured Forwarding 2 (AF2x) - x3
+ *	Assured Forwarding 3 (AF3x) - x3
+ *	Assured Forwarding 4 (AF4x) - x3
+ *	Precedence Class 2 (CS2)
+ *	Precedence Class 3 (CS3)
+ *	Precedence Class 4 (CS4)
+ *	Precedence Class 5 (CS5)
+ *	Precedence Class 6 (CS6)
+ *	Precedence Class 7 (CS7)
+ *	Voice Admit (VA)
+ *	Expedited Forwarding (EF)
+
+ *	Total 25 codepoints.
+ */
+
+/*	List of traffic classes in RFC 4594:
+ *		(roughly descending order of contended priority)
+ *		(roughly ascending order of uncontended throughput)
+ *
+ *	Network Control (CS6,CS7)      - routing traffic
+ *	Telephony (EF,VA)         - aka. VoIP streams
+ *	Signalling (CS5)               - VoIP setup
+ *	Multimedia Conferencing (AF4x) - aka. video calls
+ *	Realtime Interactive (CS4)     - eg. games
+ *	Multimedia Streaming (AF3x)    - eg. YouTube, NetFlix, Twitch
+ *	Broadcast Video (CS3)
+ *	Low Latency Data (AF2x,TOS4)      - eg. database
+ *	Ops, Admin, Management (CS2,TOS1) - eg. ssh
+ *	Standard Service (CS0 & unrecognised codepoints)
+ *	High Throughput Data (AF1x,TOS2)  - eg. web traffic
+ *	Low Priority Data (CS1)           - eg. BitTorrent
+
+ *	Total 12 traffic classes.
+ */
+
+static int cake_config_diffserv8(struct Qdisc *sch)
+{
+/*	Pruned list of traffic classes for typical applications:
+ *
+ *		Network Control          (CS6, CS7)
+ *		Minimum Latency          (EF, VA, CS5, CS4)
+ *		Interactive Shell        (CS2, TOS1)
+ *		Low Latency Transactions (AF2x, TOS4)
+ *		Video Streaming          (AF4x, AF3x, CS3)
+ *		Bog Standard             (CS0 etc.)
+ *		High Throughput          (AF1x, TOS2)
+ *		Background Traffic       (CS1)
+ *
+ *		Total 8 traffic classes.
+ */
+
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum1 = 256;
+	u32 quantum2 = 256;
+	u32 i;
+
+	q->tin_cnt = 8;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv8;
+	q->tin_order = normal_order;
+
+	/* class characteristics */
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[i];
+
+		cake_set_rate(b, rate, mtu, US2TIME(q->target),
+			      US2TIME(q->interval));
+
+		b->tin_quantum_prio = max_t(u16, 1U, quantum1);
+		b->tin_quantum_band = max_t(u16, 1U, quantum2);
+
+		/* calculate next class's parameters */
+		rate  *= 7;
+		rate >>= 3;
+
+		quantum1  *= 3;
+		quantum1 >>= 1;
+
+		quantum2  *= 7;
+		quantum2 >>= 3;
+	}
+
+	return 0;
+}
+
+static int cake_config_diffserv4(struct Qdisc *sch)
+{
+/*  Further pruned list of traffic classes for four-class system:
+ *
+ *	    Latency Sensitive  (CS7, CS6, EF, VA, CS5, CS4)
+ *	    Streaming Media    (AF4x, AF3x, CS3, AF2x, TOS4, CS2, TOS1)
+ *	    Best Effort        (CS0, AF1x, TOS2, and those not specified)
+ *	    Background Traffic (CS1)
+ *
+ *		Total 4 traffic classes.
+ */
+
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum = 1024;
+
+	q->tin_cnt = 4;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv4;
+	q->tin_order = bulk_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[0], rate, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[1], rate >> 4, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[2], rate >> 1, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[3], rate >> 2, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = quantum;
+	q->tins[1].tin_quantum_prio = quantum >> 4;
+	q->tins[2].tin_quantum_prio = quantum << 2;
+	q->tins[3].tin_quantum_prio = quantum << 4;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = quantum;
+	q->tins[1].tin_quantum_band = quantum >> 4;
+	q->tins[2].tin_quantum_band = quantum >> 1;
+	q->tins[3].tin_quantum_band = quantum >> 2;
+
+	return 0;
+}
+
+static int cake_config_diffserv3(struct Qdisc *sch)
+{
+/*  Simplified Diffserv structure with 3 tins.
+ *		Low Priority		(CS1)
+ *		Best Effort
+ *		Latency Sensitive	(TOS4, VA, EF, CS6, CS7)
+ */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 rate = q->rate_bps;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u32 quantum = 1024;
+
+	q->tin_cnt = 3;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv3;
+	q->tin_order = bulk_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[0], rate, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[1], rate >> 4, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+	cake_set_rate(&q->tins[2], rate >> 2, mtu,
+		      US2TIME(q->target), US2TIME(q->interval));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = quantum;
+	q->tins[1].tin_quantum_prio = quantum >> 4;
+	q->tins[2].tin_quantum_prio = quantum << 4;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = quantum;
+	q->tins[1].tin_quantum_band = quantum >> 4;
+	q->tins[2].tin_quantum_band = quantum >> 2;
+
+	return 0;
+}
+
+static void cake_reconfigure(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	int c, ft;
+
+	switch (q->tin_mode) {
+	case CAKE_DIFFSERV_BESTEFFORT:
+		ft = cake_config_besteffort(sch);
+		break;
+
+	case CAKE_DIFFSERV_PRECEDENCE:
+		ft = cake_config_precedence(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV8:
+		ft = cake_config_diffserv8(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV4:
+		ft = cake_config_diffserv4(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV3:
+	default:
+		ft = cake_config_diffserv3(sch);
+		break;
+	};
+
+	for (c = q->tin_cnt; c < CAKE_MAX_TINS; c++) {
+		cake_clear_tin(sch, c);
+		q->tins[c].cparams.mtu_time = q->tins[ft].cparams.mtu_time;
+	}
+
+	q->rate_ns   = q->tins[ft].tin_rate_ns;
+	q->rate_shft = q->tins[ft].tin_rate_shft;
+
+	if (q->buffer_config_limit) {
+		q->buffer_limit = q->buffer_config_limit;
+	} else if (q->rate_bps) {
+		u64 t = (u64)q->rate_bps * q->interval;
+
+		do_div(t, USEC_PER_SEC / 4);
+		q->buffer_limit = max_t(u32, t, 4U << 20);
+	} else {
+		q->buffer_limit = ~0;
+	}
+
+	sch->flags &= ~TCQ_F_CAN_BYPASS;
+
+	q->buffer_limit = min(q->buffer_limit,
+			      max(sch->limit * psched_mtu(qdisc_dev(sch)),
+				  q->buffer_config_limit));
+}
+
+static int cake_change(struct Qdisc *sch, struct nlattr *opt,
+		       struct netlink_ext_ack *extack)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct nlattr *tb[TCA_CAKE_MAX + 1];
+	int err;
+
+	if (!opt)
+		return -EINVAL;
+
+	err = nla_parse_nested(tb, TCA_CAKE_MAX, opt, cake_policy, extack);
+	if (err < 0)
+		return err;
+
+	if (tb[TCA_CAKE_BASE_RATE])
+		q->rate_bps = nla_get_u32(tb[TCA_CAKE_BASE_RATE]);
+
+	if (tb[TCA_CAKE_DIFFSERV_MODE])
+		q->tin_mode = nla_get_u32(tb[TCA_CAKE_DIFFSERV_MODE]);
+
+	if (tb[TCA_CAKE_ATM])
+		q->atm_mode = nla_get_u32(tb[TCA_CAKE_ATM]);
+
+	if (tb[TCA_CAKE_WASH]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_WASH]))
+			q->rate_flags |= CAKE_FLAG_WASH;
+		else
+			q->rate_flags &= ~CAKE_FLAG_WASH;
+	}
+
+	if (tb[TCA_CAKE_FLOW_MODE])
+		q->flow_mode = nla_get_u32(tb[TCA_CAKE_FLOW_MODE]);
+
+	if (tb[TCA_CAKE_NAT]) {
+		q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
+		q->flow_mode |= CAKE_FLOW_NAT_FLAG *
+			!!nla_get_u32(tb[TCA_CAKE_NAT]);
+	}
+
+	if (tb[TCA_CAKE_OVERHEAD]) {
+		q->rate_overhead = nla_get_s32(tb[TCA_CAKE_OVERHEAD]);
+		q->rate_flags |= CAKE_FLAG_OVERHEAD;
+
+		q->max_netlen = q->max_adjlen = 0;
+		q->min_netlen = q->min_adjlen = ~0;
+	}
+
+	if (tb[TCA_CAKE_RAW]) {
+		q->rate_flags &= ~CAKE_FLAG_OVERHEAD;
+
+		q->max_netlen = q->max_adjlen = 0;
+		q->min_netlen = q->min_adjlen = ~0;
+	}
+
+	if (tb[TCA_CAKE_MPU])
+		q->rate_mpu = nla_get_u32(tb[TCA_CAKE_MPU]);
+
+	if (tb[TCA_CAKE_RTT]) {
+		q->interval = nla_get_u32(tb[TCA_CAKE_RTT]);
+
+		if (!q->interval)
+			q->interval = 1;
+	}
+
+	if (tb[TCA_CAKE_TARGET]) {
+		q->target = nla_get_u32(tb[TCA_CAKE_TARGET]);
+
+		if (!q->target)
+			q->target = 1;
+	}
+
+	if (tb[TCA_CAKE_AUTORATE]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE]))
+			q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS;
+		else
+			q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS;
+	}
+
+	if (tb[TCA_CAKE_INGRESS]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_INGRESS]))
+			q->rate_flags |= CAKE_FLAG_INGRESS;
+		else
+			q->rate_flags &= ~CAKE_FLAG_INGRESS;
+	}
+
+	if (tb[TCA_CAKE_ACK_FILTER])
+		q->ack_filter = nla_get_u32(tb[TCA_CAKE_ACK_FILTER]);
+
+	if (tb[TCA_CAKE_MEMORY])
+		q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]);
+
+	if (q->rate_bps && q->rate_bps <= CAKE_SPLIT_GSO_THRESHOLD)
+		q->rate_flags |= CAKE_FLAG_SPLIT_GSO;
+	else
+		q->rate_flags &= ~CAKE_FLAG_SPLIT_GSO;
+
+	if (q->tins) {
+		sch_tree_lock(sch);
+		cake_reconfigure(sch);
+		sch_tree_unlock(sch);
+	}
+
+	return 0;
+}
+
+
+static void cake_free(void *addr)
+{
+	if (addr)
+		kvfree(addr);
+}
+
+static void cake_destroy(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+
+	qdisc_watchdog_cancel(&q->watchdog);
+
+	if (q->tins)
+		cake_free(q->tins);
+}
+
+static int cake_init(struct Qdisc *sch, struct nlattr *opt,
+		     struct netlink_ext_ack *extack)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	int i, j;
+
+	sch->limit = 10240;
+	q->tin_mode = CAKE_DIFFSERV_DIFFSERV3;
+	q->flow_mode  = CAKE_FLOW_TRIPLE;
+
+	q->rate_bps = 0; /* unlimited by default */
+
+	q->interval = 100000; /* 100ms default */
+	q->target   =   5000; /* 5ms: codel RFC argues
+			       * for 5 to 10% of interval
+			       */
+
+	q->cur_tin = 0;
+	q->cur_flow  = 0;
+
+	if (opt) {
+		int err = cake_change(sch, opt, extack);
+
+		if (err)
+			return err;
+	}
+
+	qdisc_watchdog_init(&q->watchdog, sch);
+
+	quantum_div[0] = ~0;
+	for (i = 1; i <= CAKE_QUEUES; i++)
+		quantum_div[i] = 65535 / i;
+
+	q->tins = kvzalloc(CAKE_MAX_TINS * sizeof(struct cake_tin_data),
+			   GFP_KERNEL | __GFP_NOWARN);
+	if (!q->tins)
+		goto nomem;
+
+	for (i = 0; i < CAKE_MAX_TINS; i++) {
+		struct cake_tin_data *b = q->tins + i;
+
+		b->perturb = prandom_u32();
+		INIT_LIST_HEAD(&b->new_flows);
+		INIT_LIST_HEAD(&b->old_flows);
+		INIT_LIST_HEAD(&b->decaying_flows);
+		b->sparse_flow_count = 0;
+		b->bulk_flow_count = 0;
+		b->decaying_flow_count = 0;
+
+		for (j = 0; j < CAKE_QUEUES; j++) {
+			struct cake_flow *flow = b->flows + j;
+			u32 k = j * CAKE_MAX_TINS + i;
+
+			INIT_LIST_HEAD(&flow->flowchain);
+			cobalt_vars_init(&flow->cvars);
+
+			q->overflow_heap[k].t = i;
+			q->overflow_heap[k].b = j;
+			b->overflow_idx[j] = k;
+		}
+	}
+
+	cake_reconfigure(sch);
+	q->avg_peak_bandwidth = q->rate_bps;
+	q->min_netlen = q->min_adjlen = ~0;
+	return 0;
+
+nomem:
+	cake_destroy(sch);
+	return -ENOMEM;
+}
+
+static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct nlattr *opts;
+
+	opts = nla_nest_start(skb, TCA_OPTIONS);
+	if (!opts)
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_BASE_RATE, q->rate_bps))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_DIFFSERV_MODE, q->tin_mode))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_ATM, q->atm_mode))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_FLOW_MODE,
+			q->flow_mode & ~CAKE_FLOW_NAT_FLAG))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_NAT,
+			!!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_SPLIT_GSO, !!(q->rate_flags & CAKE_FLAG_SPLIT_GSO)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_WASH,
+			!!(q->rate_flags & CAKE_FLAG_WASH)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_OVERHEAD, q->rate_overhead))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_MPU, q->rate_mpu))
+		goto nla_put_failure;
+
+	if (!(q->rate_flags & CAKE_FLAG_OVERHEAD))
+		if (nla_put_u32(skb, TCA_CAKE_RAW, 0))
+			goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_RTT, q->interval))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_TARGET, q->target))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_AUTORATE,
+			!!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_INGRESS,
+			!!(q->rate_flags & CAKE_FLAG_INGRESS)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit))
+		goto nla_put_failure;
+
+	return nla_nest_end(skb, opts);
+
+nla_put_failure:
+	return -1;
+}
+
+static int cake_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	struct nlattr *stats = nla_nest_start(d->skb, TCA_STATS_APP);
+	struct nlattr *tstats, *ts;
+	int i;
+
+	if (!stats)
+		return -1;
+
+#define PUT_STAT_U32(attr, data) do {				       \
+		if(nla_put_u32(d->skb, TCA_CAKE_STATS_ ## attr, data)) \
+			goto nla_put_failure;			       \
+	} while (0);
+
+	PUT_STAT_U32(CAPACITY_ESTIMATE, q->avg_peak_bandwidth);
+	PUT_STAT_U32(MEMORY_LIMIT, q->buffer_limit);
+	PUT_STAT_U32(MEMORY_USED, q->buffer_max_used);
+	PUT_STAT_U32(AVG_NETOFF, ((q->avg_netoff + 0x8000) >> 16));
+	PUT_STAT_U32(MAX_NETLEN, q->max_netlen);
+	PUT_STAT_U32(MAX_ADJLEN, q->max_adjlen);
+	PUT_STAT_U32(MIN_NETLEN, q->min_netlen);
+	PUT_STAT_U32(MIN_ADJLEN, q->min_adjlen);
+
+#undef PUT_STAT_U32
+
+	tstats = nla_nest_start(d->skb, TCA_CAKE_STATS_TIN_STATS);
+	if (!tstats)
+		goto nla_put_failure;
+
+#define PUT_TSTAT_U32(attr, data) do {					\
+		if(nla_put_u32(d->skb, TCA_CAKE_TIN_STATS_ ## attr, data)) \
+			goto nla_put_failure;				\
+	} while (0);
+#define PUT_TSTAT_U64(attr, data) do {					\
+		if(nla_put_u64_64bit(d->skb, TCA_CAKE_TIN_STATS_ ## attr, \
+					data, TCA_CAKE_TIN_STATS_PAD))	\
+			goto nla_put_failure;				\
+	} while (0);
+
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[q->tin_order[i]];
+
+		ts = nla_nest_start(d->skb, i + 1);
+		if (!ts)
+			goto nla_put_failure;
+
+		PUT_TSTAT_U32(THRESHOLD_RATE, b->tin_rate_bps);
+		PUT_TSTAT_U32(TARGET_US, cobalt_time_to_us(b->cparams.target));
+		PUT_TSTAT_U32(INTERVAL_US, cobalt_time_to_us(b->cparams.interval));
+
+		PUT_TSTAT_U32(SENT_PACKETS, b->packets);
+		PUT_TSTAT_U64(SENT_BYTES64, b->bytes);
+		PUT_TSTAT_U32(DROPPED_PACKETS, b->tin_dropped);
+		PUT_TSTAT_U32(ECN_MARKED_PACKETS, b->tin_ecn_mark);
+		PUT_TSTAT_U64(BACKLOG_BYTES64, b->tin_backlog);
+		PUT_TSTAT_U32(ACKS_DROPPED_PACKETS, b->ack_drops);
+
+		PUT_TSTAT_U32(PEAK_DELAY_US, cobalt_time_to_us(b->peak_delay));
+		PUT_TSTAT_U32(AVG_DELAY_US, cobalt_time_to_us(b->avge_delay));
+		PUT_TSTAT_U32(BASE_DELAY_US, cobalt_time_to_us(b->base_delay));
+
+		PUT_TSTAT_U32(WAY_INDIRECT_HITS, b->way_hits);
+		PUT_TSTAT_U32(WAY_MISSES, b->way_misses);
+		PUT_TSTAT_U32(WAY_COLLISIONS, b->way_collisions);
+
+		PUT_TSTAT_U32(SPARSE_FLOWS, b->sparse_flow_count +
+					   b->decaying_flow_count);
+		PUT_TSTAT_U32(BULK_FLOWS, b->bulk_flow_count);
+		PUT_TSTAT_U32(UNRESPONSIVE_FLOWS, b->unresponsive_flow_count);
+		PUT_TSTAT_U32(MAX_SKBLEN, b->max_skblen);
+
+		PUT_TSTAT_U32(FLOW_QUANTUM, b->flow_quantum);
+		nla_nest_end(d->skb, ts);
+	}
+
+#undef PUT_TSTAT_U32
+#undef PUT_TSTAT_U64
+
+	nla_nest_end(d->skb, tstats);
+	return nla_nest_end(d->skb, stats);
+
+nla_put_failure:
+	nla_nest_cancel(d->skb, stats);
+	return -1;
+}
+
+static struct Qdisc_ops cake_qdisc_ops __read_mostly = {
+	.id		=	"cake",
+	.priv_size	=	sizeof(struct cake_sched_data),
+	.enqueue	=	cake_enqueue,
+	.dequeue	=	cake_dequeue,
+	.peek		=	qdisc_peek_dequeued,
+	.init		=	cake_init,
+	.reset		=	cake_reset,
+	.destroy	=	cake_destroy,
+	.change		=	cake_change,
+	.dump		=	cake_dump,
+	.dump_stats	=	cake_dump_stats,
+	.owner		=	THIS_MODULE,
+};
+
+static int __init cake_module_init(void)
+{
+	return register_qdisc(&cake_qdisc_ops);
+}
+
+static void __exit cake_module_exit(void)
+{
+	unregister_qdisc(&cake_qdisc_ops);
+}
+
+module_init(cake_module_init)
+module_exit(cake_module_exit)
+MODULE_AUTHOR("Jonathan Morton");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_DESCRIPTION("The CAKE shaper.");
-- 
2.17.0

^ permalink raw reply related

* [PATCH iproute2-next v5] Add support for cake qdisc
From: Toke Høiland-Jørgensen @ 2018-04-27 12:17 UTC (permalink / raw)
  To: netdev; +Cc: cake, Toke Høiland-Jørgensen, Dave Taht
In-Reply-To: <20180427121706.23273-1-toke@toke.dk>

sch_cake is intended to squeeze the most bandwidth and latency out of even
the slowest ISP links and routers, while presenting an API simple enough
that even an ISP can configure it.

Example of use on a cable ISP uplink:

tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter

To shape a cable download link (ifb and tc-mirred setup elided)

tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash besteffort

Cake is filled with:

* A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
  derived Flow Queuing system, which autoconfigures based on the bandwidth.
* A novel "triple-isolate" mode (the default) which balances per-host
  and per-flow FQ even through NAT.
* An deficit based shaper, that can also be used in an unlimited mode.
* 8 way set associative hashing to reduce flow collisions to a minimum.
* A reasonable interpretation of various diffserv latency/loss tradeoffs.
* Support for zeroing diffserv markings for entering and exiting traffic.
* Support for interacting well with Docsis 3.0 shaper framing.
* Support for DSL framing types and shapers.
* Support for ack filtering.
* Extensive statistics for measuring, loss, ecn markings, latency variation.

Various versions baking have been available as an out of tree build for
kernel versions going back to 3.10, as the embedded router world has been
running a few years behind mainline Linux. A stable version has been
generally available on lede-17.01 and later.

sch_cake replaces a combination of iptables, tc filter, htb and fq_codel
in the sqm-scripts, with sane defaults and vastly simpler configuration.

Cake's principal author is Jonathan Morton, with contributions from
Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen, Sebastian Moeller,
Ryan Mounce, Guido Sarducci, Dean Scarff, Nils Andreas Svee, Dave Täht,
and Loganaden Velvindron.

Testing from Pete Heist, Georgios Amanakis, and the many other members of
the cake@lists.bufferbloat.net mailing list.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
Changelog:
v5:
  - Print the SPLIT_GSO flag
  - Switch to print_u64() for JSON output
  - Fix a format string for mpu option output

v4:
  - Switch stats parsing to use nested netlink attributes
  - Tweaks to JSON stats output keys

v3:
  - Remove accidentally included test flag

v2:
  - Updated netlink config ABI
  - Remove diffserv-llt mode
  - Various tweaks and clean-ups of stats output

 man/man8/tc-cake.8 | 632 ++++++++++++++++++++++++++++++++++++++
 man/man8/tc.8      |   1 +
 tc/Makefile        |   1 +
 tc/q_cake.c        | 739 +++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 1373 insertions(+)
 create mode 100644 man/man8/tc-cake.8
 create mode 100644 tc/q_cake.c

diff --git a/man/man8/tc-cake.8 b/man/man8/tc-cake.8
new file mode 100644
index 00000000..dff2e360
--- /dev/null
+++ b/man/man8/tc-cake.8
@@ -0,0 +1,632 @@
+.TH CAKE 8 "27 April 2018" "iproute2" "Linux"
+.SH NAME
+CAKE \- Common Applications Kept Enhanced (CAKE)
+.SH SYNOPSIS
+.B tc qdisc ... cake
+.br
+[
+.BR bandwidth
+RATE |
+.BR unlimited*
+|
+.BR autorate_ingress
+]
+.br
+[
+.BR rtt
+TIME |
+.BR datacentre
+|
+.BR lan
+|
+.BR metro
+|
+.BR regional
+|
+.BR internet*
+|
+.BR oceanic
+|
+.BR satellite
+|
+.BR interplanetary
+]
+.br
+[
+.BR besteffort
+|
+.BR diffserv8
+|
+.BR diffserv4
+|
+.BR diffserv3*
+]
+.br
+[
+.BR flowblind
+|
+.BR srchost
+|
+.BR dsthost
+|
+.BR hosts
+|
+.BR flows
+|
+.BR dual-srchost
+|
+.BR dual-dsthost
+|
+.BR triple-isolate*
+]
+.br
+[
+.BR nat
+|
+.BR nonat*
+]
+.br
+[
+.BR wash
+|
+.BR nowash*
+]
+.br
+[
+.BR ack-filter
+|
+.BR ack-filter-aggressive
+|
+.BR no-ack-filter*
+]
+.br
+[
+.BR memlimit
+LIMIT ]
+.br
+[
+.BR ptm
+|
+.BR atm
+|
+.BR noatm*
+]
+.br
+[
+.BR overhead
+N |
+.BR conservative
+|
+.BR raw*
+]
+.br
+[
+.BR mpu
+N ]
+.br
+[
+.BR ingress
+|
+.BR egress*
+]
+.br
+(* marks defaults)
+
+
+.SH DESCRIPTION
+CAKE (Common Applications Kept Enhanced) is a shaping-capable queue discipline
+which uses both AQM and FQ.  It combines COBALT, which is an AQM algorithm
+combining Codel and BLUE, a shaper which operates in deficit mode, and a variant
+of DRR++ for flow isolation.  8-way set-associative hashing is used to virtually
+eliminate hash collisions.  Priority queuing is available through a simplified
+diffserv implementation.  Overhead compensation for various encapsulation
+schemes is tightly integrated.
+
+All settings are optional; the default settings are chosen to be sensible in
+most common deployments.  Most people will only need to set the
+.B bandwidth
+parameter to get useful results, but reading the
+.B Overhead Compensation
+and
+.B Round Trip Time
+sections is strongly encouraged.
+
+.SH SHAPER PARAMETERS
+CAKE uses a deficit-mode shaper, which does not exhibit the initial burst
+typical of token-bucket shapers.  It will automatically burst precisely as much
+as required to maintain the configured throughput.  As such, it is very
+straightforward to configure.
+.PP
+.B unlimited
+(default)
+.br
+	No limit on the bandwidth.
+.PP
+.B bandwidth
+RATE
+.br
+	Set the shaper bandwidth.  See
+.BR tc(8)
+or examples below for details of the RATE value.
+.PP
+.B autorate_ingress
+.br
+	Automatic capacity estimation based on traffic arriving at this qdisc.
+This is most likely to be useful with cellular links, which tend to change
+quality randomly.  A
+.B bandwidth
+parameter can be used in conjunction to specify an initial estimate.  The shaper
+will periodically be set to a bandwidth slightly below the estimated rate.  This
+estimator cannot estimate the bandwidth of links downstream of itself.
+
+.SH OVERHEAD COMPENSATION PARAMETERS
+The size of each packet on the wire may differ from that seen by Linux.  The
+following parameters allow CAKE to compensate for this difference by internally
+considering each packet to be bigger than Linux informs it.  To assist users who
+are not expert network engineers, keywords have been provided to represent a
+number of common link technologies.
+
+.SS	Manual Overhead Specification
+.B overhead
+BYTES
+.br
+	Adds BYTES to the size of each packet.  BYTES may be negative; values
+between -64 and 256 (inclusive) are accepted.
+.PP
+.B mpu
+BYTES
+.br
+	Rounds each packet (including overhead) up to a minimum length
+BYTES. BYTES may not be negative; values between 0 and 256 (inclusive)
+are accepted.
+.PP
+.B atm
+.br
+	Compensates for ATM cell framing, which is normally found on ADSL links.
+This is performed after the
+.B overhead
+parameter above.  ATM uses fixed 53-byte cells, each of which can carry 48 bytes
+payload.
+.PP
+.B ptm
+.br
+	Compensates for PTM encoding, which is normally found on VDSL2 links and
+uses a 64b/65b encoding scheme. It is even more efficient to simply
+derate the specified shaper bandwidth by a factor of 64/65 or 0.984. See
+ITU G.992.3 Annex N and IEEE 802.3 Section 61.3 for details.
+.PP
+.B noatm
+.br
+	Disables ATM and PTM compensation.
+
+.SS	Failsafe Overhead Keywords
+These two keywords are provided for quick-and-dirty setup.  Use them if you
+can't be bothered to read the rest of this section.
+.PP
+.B raw
+(default)
+.br
+	Turns off all overhead compensation in CAKE.  The packet size reported
+by Linux will be used directly.
+.PP
+	Other overhead keywords may be added after "raw".  The effect of this is
+to make the overhead compensation operate relative to the reported packet size,
+not the underlying IP packet size.
+.PP
+.B conservative
+.br
+	Compensates for more overhead than is likely to occur on any
+widely-deployed link technology.
+.br
+	Equivalent to
+.B overhead 48 atm.
+
+.SS ADSL Overhead Keywords
+Most ADSL modems have a way to check which framing scheme is in use.  Often this
+is also specified in the settings document provided by the ISP.  The keywords in
+this section are intended to correspond with these sources of information.  All
+of them implicitly set the
+.B atm
+flag.
+.PP
+.B pppoa-vcmux
+.br
+	Equivalent to
+.B overhead 10 atm
+.PP
+.B pppoa-llc
+.br
+	Equivalent to
+.B overhead 14 atm
+.PP
+.B pppoe-vcmux
+.br
+	Equivalent to
+.B overhead 32 atm
+.PP
+.B pppoe-llcsnap
+.br
+	Equivalent to
+.B overhead 40 atm
+.PP
+.B bridged-vcmux
+.br
+	Equivalent to
+.B overhead 24 atm
+.PP
+.B bridged-llcsnap
+.br
+	Equivalent to
+.B overhead 32 atm
+.PP
+.B ipoa-vcmux
+.br
+	Equivalent to
+.B overhead 8 atm
+.PP
+.B ipoa-llcsnap
+.br
+	Equivalent to
+.B overhead 16 atm
+.PP
+See also the Ethernet Correction Factors section below.
+
+.SS VDSL2 Overhead Keywords
+ATM was dropped from VDSL2 in favour of PTM, which is a much more
+straightforward framing scheme.  Some ISPs retained PPPoE for compatibility with
+their existing back-end systems.
+.PP
+.B pppoe-ptm
+.br
+	Equivalent to
+.B overhead 30 ptm
+
+.br
+	PPPoE: 2B PPP + 6B PPPoE +
+.br
+	ETHERNET: 6B dest MAC + 6B src MAC + 2B ethertype + 4B Frame Check Sequence +
+.br
+	PTM: 1B Start of Frame (S) + 1B End of Frame (Ck) + 2B TC-CRC (PTM-FCS)
+.br
+.PP
+.B bridged-ptm
+.br
+	Equivalent to
+.B overhead 22 ptm
+.br
+	ETHERNET: 6B dest MAC + 6B src MAC + 2B ethertype + 4B Frame Check Sequence +
+.br
+	PTM: 1B Start of Frame (S) + 1B End of Frame (Ck) + 2B TC-CRC (PTM-FCS)
+.br
+.PP
+See also the Ethernet Correction Factors section below.
+
+.SS DOCSIS Cable Overhead Keyword
+DOCSIS is the universal standard for providing Internet service over cable-TV
+infrastructure.
+
+In this case, the actual on-wire overhead is less important than the packet size
+the head-end equipment uses for shaping and metering.  This is specified to be
+an Ethernet frame including the CRC (aka FCS).
+.PP
+.B docsis
+.br
+	Equivalent to
+.B overhead 18 mpu 64 noatm
+
+.SS Ethernet Overhead Keywords
+.PP
+.B ethernet
+.br
+	Accounts for Ethernet's preamble, inter-frame gap, and Frame Check
+Sequence.  Use this keyword when the bottleneck being shaped for is an
+actual Ethernet cable.
+.br
+	Equivalent to
+.B overhead 38 mpu 84 noatm
+.PP
+.B ether-vlan
+.br
+	Adds 4 bytes to the overhead compensation, accounting for an IEEE 802.1Q
+VLAN header appended to the Ethernet frame header.  NB: Some ISPs use one or
+even two of these within PPPoE; this keyword may be repeated as necessary to
+express this.
+
+.SH ROUND TRIP TIME PARAMETERS
+Active Queue Management (AQM) consists of embedding congestion signals in the
+packet flow, which receivers use to instruct senders to slow down when the queue
+is persistently occupied.  CAKE uses ECN signalling when available, and packet
+drops otherwise, according to a combination of the Codel and BLUE AQM algorithms
+called COBALT.
+
+Very short latencies require a very rapid AQM response to adequately control
+latency.  However, such a rapid response tends to impair throughput when the
+actual RTT is relatively long.  CAKE allows specifying the RTT it assumes for
+tuning various parameters.  Actual RTTs within an order of magnitude of this
+will generally work well for both throughput and latency management.
+
+At the 'lan' setting and below, the time constants are similar in magnitude to
+the jitter in the Linux kernel itself, so congestion might be signalled
+prematurely. The flows will then become sparse and total throughput reduced,
+leaving little or no back-pressure for the fairness logic to work against. Use
+the "metro" setting for local lans unless you have a custom kernel.
+.PP
+.B rtt
+TIME
+.br
+	Manually specify an RTT.
+.PP
+.B datacentre
+.br
+	For extremely high-performance 10GigE+ networks only.  Equivalent to
+.B rtt 100us.
+.PP
+.B lan
+.br
+	For pure Ethernet (not Wi-Fi) networks, at home or in the office.  Don't
+use this when shaping for an Internet access link.  Equivalent to
+.B rtt 1ms.
+.PP
+.B metro
+.br
+	For traffic mostly within a single city.  Equivalent to
+.B rtt 10ms.
+.PP
+.B regional
+.br
+	For traffic mostly within a European-sized country.  Equivalent to
+.B rtt 30ms.
+.PP
+.B internet
+(default)
+.br
+	This is suitable for most Internet traffic.  Equivalent to
+.B rtt 100ms.
+.PP
+.B oceanic
+.br
+	For Internet traffic with generally above-average latency, such as that
+suffered by Australasian residents.  Equivalent to
+.B rtt 300ms.
+.PP
+.B satellite
+.br
+	For traffic via geostationary satellites.  Equivalent to
+.B rtt 1000ms.
+.PP
+.B interplanetary
+.br
+	So named because Jupiter is about 1 light-hour from Earth.  Use this to
+(almost) completely disable AQM actions.  Equivalent to
+.B rtt 3600s.
+
+.SH FLOW ISOLATION PARAMETERS
+With flow isolation enabled, CAKE places packets from different flows into
+different queues, each of which carries its own AQM state.  Packets from each
+queue are then delivered fairly, according to a DRR++ algorithm which minimises
+latency for "sparse" flows.  CAKE uses a set-associative hashing algorithm to
+minimise flow collisions.
+
+These keywords specify whether fairness based on source address, destination
+address, individual flows, or any combination of those is desired.
+.PP
+.B flowblind
+.br
+	Disables flow isolation; all traffic passes through a single queue for
+each tin.
+.PP
+.B srchost
+.br
+	Flows are defined only by source address.  Could be useful on the egress
+path of an ISP backhaul.
+.PP
+.B dsthost
+.br
+	Flows are defined only by destination address.  Could be useful on the
+ingress path of an ISP backhaul.
+.PP
+.B hosts
+.br
+	Flows are defined by source-destination host pairs.  This is host
+isolation, rather than flow isolation.
+.PP
+.B flows
+.br
+	Flows are defined by the entire 5-tuple of source address, destination
+address, transport protocol, source port and destination port.  This is the type
+of flow isolation performed by SFQ and fq_codel.
+.PP
+.B dual-srchost
+.br
+	Flows are defined by the 5-tuple, and fairness is applied first over
+source addresses, then over individual flows.  Good for use on egress traffic
+from a LAN to the internet, where it'll prevent any one LAN host from
+monopolising the uplink, regardless of the number of flows they use.
+.PP
+.B dual-dsthost
+.br
+	Flows are defined by the 5-tuple, and fairness is applied first over
+destination addresses, then over individual flows.  Good for use on ingress
+traffic to a LAN from the internet, where it'll prevent any one LAN host from
+monopolising the downlink, regardless of the number of flows they use.
+.PP
+.B triple-isolate
+(default)
+.br
+	Flows are defined by the 5-tuple, and fairness is applied over source
+*and* destination addresses intelligently (ie. not merely by host-pairs), and
+also over individual flows.  Use this if you're not certain whether to use
+dual-srchost or dual-dsthost; it'll do both jobs at once, preventing any one
+host on *either* side of the link from monopolising it with a large number of
+flows.
+.PP
+.B nat
+.br
+	Instructs Cake to perform a NAT lookup before applying flow-isolation
+rules, to determine the true addresses and port numbers of the packet, to
+improve fairness between hosts "inside" the NAT.  This has no practical effect
+in "flowblind" or "flows" modes, or if NAT is performed on a different host.
+.PP
+.B nonat
+(default)
+.br
+	Cake will not perform a NAT lookup.  Flow isolation will be performed
+using the addresses and port numbers directly visible to the interface Cake is
+attached to.
+
+.SH PRIORITY QUEUE PARAMETERS
+CAKE can divide traffic into "tins" based on the Diffserv field.  Each tin has
+its own independent set of flow-isolation queues, and is serviced based on a WRR
+algorithm.  To avoid perverse Diffserv marking incentives, tin weights have a
+"priority sharing" value when bandwidth used by that tin is below a threshold,
+and a lower "bandwidth sharing" value when above.  Bandwidth is compared against
+the threshold using the same algorithm as the deficit-mode shaper.
+
+Detailed customisation of tin parameters is not provided.  The following presets
+perform all necessary tuning, relative to the current shaper bandwidth and RTT
+settings.
+.PP
+.B besteffort
+.br
+	Disables priority queuing by placing all traffic in one tin.
+.PP
+.B precedence
+.br
+	Enables legacy interpretation of TOS "Precedence" field.  Use of this
+preset on the modern Internet is firmly discouraged.
+.PP
+.B diffserv4
+.br
+	Provides a general-purpose Diffserv implementation with four tins:
+.br
+		Bulk (CS1), 6.25% threshold, generally low priority.
+.br
+		Best Effort (general), 100% threshold.
+.br
+		Video (AF4x, AF3x, CS3, AF2x, CS2, TOS4, TOS1), 50% threshold.
+.br
+		Voice (CS7, CS6, EF, VA, CS5, CS4), 25% threshold.
+.PP
+.B diffserv3
+(default)
+.br
+	Provides a simple, general-purpose Diffserv implementation with three tins:
+.br
+		Bulk (CS1), 6.25% threshold, generally low priority.
+.br
+		Best Effort (general), 100% threshold.
+.br
+		Voice (CS7, CS6, EF, VA, TOS4), 25% threshold, reduced Codel interval.
+
+.SH OTHER PARAMETERS
+.B memlimit
+LIMIT
+.br
+	Limit the memory consumed by Cake to LIMIT bytes. Note that this does
+not translate directly to queue size (so do not size this based on bandwidth
+delay product considerations, but rather on worst case acceptable memory
+consumption), as there is some overhead in the data structures containing the
+packets, especially for small packets.
+
+	By default, the limit is calculated based on the bandwidth and RTT
+settings.
+
+.PP
+.B wash
+
+.br
+	Traffic entering your diffserv domain is frequently mis-marked in
+transit from the perspective of your network, and traffic exiting yours may be
+mis-marked from the perspective of the transiting provider.
+
+Apply the wash option to clear all extra diffserv (but not ECN bits), after
+priority queuing has taken place.
+
+If you are shaping inbound, and cannot trust the diffserv markings (as is the
+case for Comcast Cable, among others), it is best to use a single queue
+"besteffort" mode with wash.
+
+.SH EXAMPLES
+# tc qdisc delete root dev eth0
+.br
+# tc qdisc add root dev eth0 cake bandwidth 100Mbit ethernet
+.br
+# tc -s qdisc show dev eth0
+.br
+qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms noatm overhead 38 mpu 84 
+ Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
+ backlog 0b 0p requeues 0
+ memory used: 0b of 5000000b
+ capacity estimate: 100Mbit
+ min/max network layer size:        65535 /       0
+ min/max overhead-adjusted size:    65535 /       0
+ average network hdr offset:            0
+
+                   Bulk  Best Effort        Voice
+  thresh       6250Kbit      100Mbit       25Mbit
+  target          5.0ms        5.0ms        5.0ms
+  interval      100.0ms      100.0ms      100.0ms
+  pk_delay          0us          0us          0us
+  av_delay          0us          0us          0us
+  sp_delay          0us          0us          0us
+  pkts                0            0            0
+  bytes               0            0            0
+  way_inds            0            0            0
+  way_miss            0            0            0
+  way_cols            0            0            0
+  drops               0            0            0
+  marks               0            0            0
+  ack_drop            0            0            0
+  sp_flows            0            0            0
+  bk_flows            0            0            0
+  un_flows            0            0            0
+  max_len             0            0            0
+  quantum           300         1514          762
+
+After some use:
+.br
+# tc -s qdisc show dev eth0
+
+qdisc cake 1: root refcnt 2 bandwidth 100Mbit diffserv3 triple-isolate rtt 100.0ms noatm overhead 38 mpu 84 
+ Sent 44709231 bytes 31931 pkt (dropped 45, overlimits 93782 requeues 0) 
+ backlog 33308b 22p requeues 0
+ memory used: 292352b of 5000000b
+ capacity estimate: 100Mbit
+ min/max network layer size:           28 /    1500
+ min/max overhead-adjusted size:       84 /    1538
+ average network hdr offset:           14
+
+                   Bulk  Best Effort        Voice
+  thresh       6250Kbit      100Mbit       25Mbit
+  target          5.0ms        5.0ms        5.0ms
+  interval      100.0ms      100.0ms      100.0ms
+  pk_delay        8.7ms        6.9ms        5.0ms
+  av_delay        4.9ms        5.3ms        3.8ms
+  sp_delay        727us        1.4ms        511us
+  pkts             2590        21271         8137
+  bytes         3081804     30302659     11426206
+  way_inds            0           46            0
+  way_miss            3           17            4
+  way_cols            0            0            0
+  drops              20           15           10
+  marks               0            0            0
+  ack_drop            0            0            0
+  sp_flows            2            4            1
+  bk_flows            1            2            1
+  un_flows            0            0            0
+  max_len          1514         1514         1514
+  quantum           300         1514          762
+
+.SH SEE ALSO
+.BR tc (8),
+.BR tc-codel (8),
+.BR tc-fq_codel (8),
+.BR tc-red (8)
+
+.SH AUTHORS
+Cake's principal author is Jonathan Morton, with contributions from
+Tony Ambardar, Kevin Darbyshire-Bryant, Toke Høiland-Jørgensen,
+Sebastian Moeller, Ryan Mounce, Dean Scarff, Nils Andreas Svee, and Dave Täht.
+
+This manual page was written by Loganaden Velvindron. Please report corrections
+to the Linux Networking mailing list <netdev@vger.kernel.org>.
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 840880fb..716dfec5 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -795,6 +795,7 @@ was written by Alexey N. Kuznetsov and added in Linux 2.2.
 .BR tc-basic (8),
 .BR tc-bfifo (8),
 .BR tc-bpf (8),
+.BR tc-cake (8),
 .BR tc-cbq (8),
 .BR tc-cgroup (8),
 .BR tc-choke (8),
diff --git a/tc/Makefile b/tc/Makefile
index dfd00267..d9a43568 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -66,6 +66,7 @@ TCMODULES += q_codel.o
 TCMODULES += q_fq_codel.o
 TCMODULES += q_fq.o
 TCMODULES += q_pie.o
+TCMODULES += q_cake.o
 TCMODULES += q_hhf.o
 TCMODULES += q_clsact.o
 TCMODULES += e_bpf.o
diff --git a/tc/q_cake.c b/tc/q_cake.c
new file mode 100644
index 00000000..a1e66f4e
--- /dev/null
+++ b/tc/q_cake.c
@@ -0,0 +1,739 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/*
+ * Common Applications Kept Enhanced  --  CAKE
+ *
+ *  Copyright (C) 2014-2018 Jonathan Morton <chromatix99@gmail.com>
+ *  Copyright (C) 2017-2018 Toke Høiland-Jørgensen <toke@toke.dk>
+ */
+
+#include <stddef.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <fcntl.h>
+#include <sys/socket.h>
+#include <netinet/in.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <inttypes.h>
+
+#include "utils.h"
+#include "tc_util.h"
+
+static void explain(void)
+{
+	fprintf(stderr,
+"Usage: ... cake [ bandwidth RATE | unlimited* | autorate_ingress ]\n"
+"                [ rtt TIME | datacentre | lan | metro | regional |\n"
+"                  internet* | oceanic | satellite | interplanetary ]\n"
+"                [ besteffort | diffserv8 | diffserv4 | diffserv3* ]\n"
+"                [ flowblind | srchost | dsthost | hosts | flows |\n"
+"                  dual-srchost | dual-dsthost | triple-isolate* ]\n"
+"                [ nat | nonat* ]\n"
+"                [ wash | nowash* ]\n"
+"                [ ack-filter | ack-filter-aggressive | no-ack-filter* ]\n"
+"                [ memlimit LIMIT ]\n"
+"                [ ptm | atm | noatm* ] [ overhead N | conservative | raw* ]\n"
+"                [ mpu N ] [ ingress | egress* ]\n"
+"                (* marks defaults)\n");
+}
+
+static int cake_parse_opt(struct qdisc_util *qu, int argc, char **argv,
+			  struct nlmsghdr *n, const char *dev)
+{
+	int unlimited = 0;
+	unsigned bandwidth = 0;
+	unsigned interval = 0;
+	unsigned target = 0;
+	unsigned diffserv = 0;
+	unsigned memlimit = 0;
+	int  overhead = 0;
+	bool overhead_set = false;
+	bool overhead_override = false;
+	int mpu = 0;
+	int flowmode = -1;
+	int nat = -1;
+	int atm = -1;
+	int autorate = -1;
+	int wash = -1;
+	int ingress = -1;
+	int ack_filter = -1;
+	struct rtattr *tail;
+
+	while (argc > 0) {
+		if (strcmp(*argv, "bandwidth") == 0) {
+			NEXT_ARG();
+			if (get_rate(&bandwidth, *argv)) {
+				fprintf(stderr, "Illegal \"bandwidth\"\n");
+				return -1;
+			}
+			unlimited = 0;
+			autorate = 0;
+		} else if (strcmp(*argv, "unlimited") == 0) {
+			bandwidth = 0;
+			unlimited = 1;
+			autorate = 0;
+		} else if (strcmp(*argv, "autorate_ingress") == 0) {
+			autorate = 1;
+
+		} else if (strcmp(*argv, "rtt") == 0) {
+			NEXT_ARG();
+			if (get_time(&interval, *argv)) {
+				fprintf(stderr, "Illegal \"rtt\"\n");
+				return -1;
+			}
+			target = interval / 20;
+			if(!target)
+				target = 1;
+		} else if (strcmp(*argv, "datacentre") == 0) {
+			interval = 100;
+			target   =   5;
+		} else if (strcmp(*argv, "lan") == 0) {
+			interval = 1000;
+			target   =   50;
+		} else if (strcmp(*argv, "metro") == 0) {
+			interval = 10000;
+			target   =   500;
+		} else if (strcmp(*argv, "regional") == 0) {
+			interval = 30000;
+			target    = 1500;
+		} else if (strcmp(*argv, "internet") == 0) {
+			interval = 100000;
+			target   =   5000;
+		} else if (strcmp(*argv, "oceanic") == 0) {
+			interval = 300000;
+			target   =  15000;
+		} else if (strcmp(*argv, "satellite") == 0) {
+			interval = 1000000;
+			target   =   50000;
+		} else if (strcmp(*argv, "interplanetary") == 0) {
+			interval = 1000000000;
+			target   =   50000000;
+
+		} else if (strcmp(*argv, "besteffort") == 0) {
+			diffserv = CAKE_DIFFSERV_BESTEFFORT;
+		} else if (strcmp(*argv, "precedence") == 0) {
+			diffserv = CAKE_DIFFSERV_PRECEDENCE;
+		} else if (strcmp(*argv, "diffserv8") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV8;
+		} else if (strcmp(*argv, "diffserv4") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV4;
+		} else if (strcmp(*argv, "diffserv") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV4;
+		} else if (strcmp(*argv, "diffserv3") == 0) {
+			diffserv = CAKE_DIFFSERV_DIFFSERV3;
+
+		} else if (strcmp(*argv, "nowash") == 0) {
+			wash = 0;
+		} else if (strcmp(*argv, "wash") == 0) {
+			wash = 1;
+
+		} else if (strcmp(*argv, "flowblind") == 0) {
+			flowmode = CAKE_FLOW_NONE;
+		} else if (strcmp(*argv, "srchost") == 0) {
+			flowmode = CAKE_FLOW_SRC_IP;
+		} else if (strcmp(*argv, "dsthost") == 0) {
+			flowmode = CAKE_FLOW_DST_IP;
+		} else if (strcmp(*argv, "hosts") == 0) {
+			flowmode = CAKE_FLOW_HOSTS;
+		} else if (strcmp(*argv, "flows") == 0) {
+			flowmode = CAKE_FLOW_FLOWS;
+		} else if (strcmp(*argv, "dual-srchost") == 0) {
+			flowmode = CAKE_FLOW_DUAL_SRC;
+		} else if (strcmp(*argv, "dual-dsthost") == 0) {
+			flowmode = CAKE_FLOW_DUAL_DST;
+		} else if (strcmp(*argv, "triple-isolate") == 0) {
+			flowmode = CAKE_FLOW_TRIPLE;
+
+		} else if (strcmp(*argv, "nat") == 0) {
+			nat = 1;
+		} else if (strcmp(*argv, "nonat") == 0) {
+			nat = 0;
+
+		} else if (strcmp(*argv, "ptm") == 0) {
+			atm = CAKE_ATM_PTM;
+		} else if (strcmp(*argv, "atm") == 0) {
+			atm = CAKE_ATM_ATM;
+		} else if (strcmp(*argv, "noatm") == 0) {
+			atm = CAKE_ATM_NONE;
+
+		} else if (strcmp(*argv, "raw") == 0) {
+			atm = CAKE_ATM_NONE;
+			overhead = 0;
+			overhead_set = true;
+			overhead_override = true;
+		} else if (strcmp(*argv, "conservative") == 0) {
+			/*
+			 * Deliberately over-estimate overhead:
+			 * one whole ATM cell plus ATM framing.
+			 * A safe choice if the actual overhead is unknown.
+			 */
+			atm = CAKE_ATM_ATM;
+			overhead = 48;
+			overhead_set = true;
+
+		/* Various ADSL framing schemes, all over ATM cells */
+		} else if (strcmp(*argv, "ipoa-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 8;
+			overhead_set = true;
+		} else if (strcmp(*argv, "ipoa-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 16;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 24;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 32;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoa-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 10;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoa-llc") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 14;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoe-vcmux") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 32;
+			overhead_set = true;
+		} else if (strcmp(*argv, "pppoe-llcsnap") == 0) {
+			atm = CAKE_ATM_ATM;
+			overhead += 40;
+			overhead_set = true;
+
+		/* Typical VDSL2 framing schemes, both over PTM */
+		/* PTM has 64b/65b coding which absorbs some bandwidth */
+		} else if (strcmp(*argv, "pppoe-ptm") == 0) {
+			/* 2B PPP + 6B PPPoE + 6B dest MAC + 6B src MAC
+			 * + 2B ethertype + 4B Frame Check Sequence
+			 * + 1B Start of Frame (S) + 1B End of Frame (Ck)
+			 * + 2B TC-CRC (PTM-FCS) = 30B
+			 */
+			atm = CAKE_ATM_PTM;
+			overhead += 30;
+			overhead_set = true;
+		} else if (strcmp(*argv, "bridged-ptm") == 0) {
+			/* 6B dest MAC + 6B src MAC + 2B ethertype
+			 * + 4B Frame Check Sequence
+			 * + 1B Start of Frame (S) + 1B End of Frame (Ck)
+			 * + 2B TC-CRC (PTM-FCS) = 22B
+			 */
+			atm = CAKE_ATM_PTM;
+			overhead += 22;
+			overhead_set = true;
+
+		} else if (strcmp(*argv, "via-ethernet") == 0) {
+			/*
+			 * We used to use this flag to manually compensate for
+			 * Linux including the Ethernet header on Ethernet-type
+			 * interfaces, but not on IP-type interfaces.
+			 *
+			 * It is no longer needed, because Cake now adjusts for
+			 * that automatically, and is thus ignored.
+			 *
+			 * It would be deleted entirely, but it appears in the
+			 * stats output when the automatic compensation is
+			 * active.
+			 */
+
+		} else if (strcmp(*argv, "ethernet") == 0) {
+			/* ethernet pre-amble & interframe gap & FCS
+			 * you may need to add vlan tag */
+			overhead += 38;
+			overhead_set = true;
+			mpu = 84;
+
+		/* Additional Ethernet-related overhead used by some ISPs */
+		} else if (strcmp(*argv, "ether-vlan") == 0) {
+			/* 802.1q VLAN tag - may be repeated */
+			overhead += 4;
+			overhead_set = true;
+
+		/*
+		 * DOCSIS cable shapers account for Ethernet frame with FCS,
+		 * but not interframe gap or preamble.
+		 */
+		} else if (strcmp(*argv, "docsis") == 0) {
+			atm = CAKE_ATM_NONE;
+			overhead += 18;
+			overhead_set = true;
+			mpu = 64;
+
+		} else if (strcmp(*argv, "overhead") == 0) {
+			char* p = NULL;
+			NEXT_ARG();
+			overhead = strtol(*argv, &p, 10);
+			if(!p || *p || !*argv || overhead < -64 || overhead > 256) {
+				fprintf(stderr, "Illegal \"overhead\", valid range is -64 to 256\\n");
+				return -1;
+			}
+			overhead_set = true;
+
+		} else if (strcmp(*argv, "mpu") == 0) {
+			char* p = NULL;
+			NEXT_ARG();
+			mpu = strtol(*argv, &p, 10);
+			if(!p || *p || !*argv || mpu < 0 || mpu > 256) {
+				fprintf(stderr, "Illegal \"mpu\", valid range is 0 to 256\\n");
+				return -1;
+			}
+
+		} else if (strcmp(*argv, "ingress") == 0) {
+			ingress = 1;
+		} else if (strcmp(*argv, "egress") == 0) {
+			ingress = 0;
+
+		} else if (strcmp(*argv, "no-ack-filter") == 0) {
+			ack_filter = CAKE_ACK_NONE;
+		} else if (strcmp(*argv, "ack-filter") == 0) {
+			ack_filter = CAKE_ACK_FILTER;
+		} else if (strcmp(*argv, "ack-filter-aggressive") == 0) {
+			ack_filter = CAKE_ACK_AGGRESSIVE;
+
+		} else if (strcmp(*argv, "memlimit") == 0) {
+			NEXT_ARG();
+			if(get_size(&memlimit, *argv)) {
+				fprintf(stderr, "Illegal value for \"memlimit\": \"%s\"\n", *argv);
+				return -1;
+			}
+
+		} else if (strcmp(*argv, "help") == 0) {
+			explain();
+			return -1;
+		} else {
+			fprintf(stderr, "What is \"%s\"?\n", *argv);
+			explain();
+			return -1;
+		}
+		argc--; argv++;
+	}
+
+	tail = NLMSG_TAIL(n);
+	addattr_l(n, 1024, TCA_OPTIONS, NULL, 0);
+	if (bandwidth || unlimited)
+		addattr_l(n, 1024, TCA_CAKE_BASE_RATE, &bandwidth, sizeof(bandwidth));
+	if (diffserv)
+		addattr_l(n, 1024, TCA_CAKE_DIFFSERV_MODE, &diffserv, sizeof(diffserv));
+	if (atm != -1)
+		addattr_l(n, 1024, TCA_CAKE_ATM, &atm, sizeof(atm));
+	if (flowmode != -1)
+		addattr_l(n, 1024, TCA_CAKE_FLOW_MODE, &flowmode, sizeof(flowmode));
+	if (overhead_set)
+		addattr_l(n, 1024, TCA_CAKE_OVERHEAD, &overhead, sizeof(overhead));
+	if (overhead_override) {
+		unsigned zero = 0;
+		addattr_l(n, 1024, TCA_CAKE_RAW, &zero, sizeof(zero));
+	}
+	if (mpu > 0)
+		addattr_l(n, 1024, TCA_CAKE_MPU, &mpu, sizeof(mpu));
+	if (interval)
+		addattr_l(n, 1024, TCA_CAKE_RTT, &interval, sizeof(interval));
+	if (target)
+		addattr_l(n, 1024, TCA_CAKE_TARGET, &target, sizeof(target));
+	if (autorate != -1)
+		addattr_l(n, 1024, TCA_CAKE_AUTORATE, &autorate, sizeof(autorate));
+	if (memlimit)
+		addattr_l(n, 1024, TCA_CAKE_MEMORY, &memlimit, sizeof(memlimit));
+	if (nat != -1)
+		addattr_l(n, 1024, TCA_CAKE_NAT, &nat, sizeof(nat));
+	if (wash != -1)
+		addattr_l(n, 1024, TCA_CAKE_WASH, &wash, sizeof(wash));
+	if (ingress != -1)
+		addattr_l(n, 1024, TCA_CAKE_INGRESS, &ingress, sizeof(ingress));
+	if (ack_filter != -1)
+		addattr_l(n, 1024, TCA_CAKE_ACK_FILTER, &ack_filter, sizeof(ack_filter));
+
+	tail->rta_len = (void *) NLMSG_TAIL(n) - (void *) tail;
+	return 0;
+}
+
+
+static int cake_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
+{
+	struct rtattr *tb[TCA_CAKE_MAX + 1];
+	unsigned bandwidth = 0;
+	unsigned diffserv = 0;
+	unsigned flowmode = 0;
+	unsigned interval = 0;
+	unsigned memlimit = 0;
+	int overhead = 0;
+	int raw = 0;
+	int mpu = 0;
+	int atm = 0;
+	int nat = 0;
+	int autorate = 0;
+	int wash = 0;
+	int ingress = 0;
+	int ack_filter = 0;
+	int split_gso = 0;
+	SPRINT_BUF(b1);
+	SPRINT_BUF(b2);
+
+	if (opt == NULL)
+		return 0;
+
+	parse_rtattr_nested(tb, TCA_CAKE_MAX, opt);
+
+	if (tb[TCA_CAKE_BASE_RATE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_BASE_RATE]) >= sizeof(__u32)) {
+		bandwidth = rta_getattr_u32(tb[TCA_CAKE_BASE_RATE]);
+		if(bandwidth) {
+			print_uint(PRINT_JSON, "bandwidth", NULL, bandwidth);
+			print_string(PRINT_FP, NULL, "bandwidth %s ", sprint_rate(bandwidth, b1));
+		} else
+			print_string(PRINT_ANY, "bandwidth", "bandwidth %s ", "unlimited");
+	}
+	if (tb[TCA_CAKE_AUTORATE] &&
+		RTA_PAYLOAD(tb[TCA_CAKE_AUTORATE]) >= sizeof(__u32)) {
+		autorate = rta_getattr_u32(tb[TCA_CAKE_AUTORATE]);
+		if(autorate == 1)
+			print_string(PRINT_ANY, "autorate", "autorate_%s ", "ingress");
+		else if(autorate)
+			print_string(PRINT_ANY, "autorate", "(?autorate?) ", "unknown");
+	}
+	if (tb[TCA_CAKE_DIFFSERV_MODE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_DIFFSERV_MODE]) >= sizeof(__u32)) {
+		diffserv = rta_getattr_u32(tb[TCA_CAKE_DIFFSERV_MODE]);
+		switch(diffserv) {
+		case CAKE_DIFFSERV_DIFFSERV3:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv3");
+			break;
+		case CAKE_DIFFSERV_DIFFSERV4:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv4");
+			break;
+		case CAKE_DIFFSERV_DIFFSERV8:
+			print_string(PRINT_ANY, "diffserv", "%s ", "diffserv8");
+			break;
+		case CAKE_DIFFSERV_BESTEFFORT:
+			print_string(PRINT_ANY, "diffserv", "%s ", "besteffort");
+			break;
+		case CAKE_DIFFSERV_PRECEDENCE:
+			print_string(PRINT_ANY, "diffserv", "%s ", "precedence");
+			break;
+		default:
+			print_string(PRINT_ANY, "diffserv", "(?diffserv?) ", "unknown");
+			break;
+		};
+	}
+	if (tb[TCA_CAKE_FLOW_MODE] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_FLOW_MODE]) >= sizeof(__u32)) {
+		flowmode = rta_getattr_u32(tb[TCA_CAKE_FLOW_MODE]);
+		switch(flowmode) {
+		case CAKE_FLOW_NONE:
+			print_string(PRINT_ANY, "flowmode", "%s ", "flowblind");
+			break;
+		case CAKE_FLOW_SRC_IP:
+			print_string(PRINT_ANY, "flowmode", "%s ", "srchost");
+			break;
+		case CAKE_FLOW_DST_IP:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dsthost");
+			break;
+		case CAKE_FLOW_HOSTS:
+			print_string(PRINT_ANY, "flowmode", "%s ", "hosts");
+			break;
+		case CAKE_FLOW_FLOWS:
+			print_string(PRINT_ANY, "flowmode", "%s ", "flows");
+			break;
+		case CAKE_FLOW_DUAL_SRC:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dual-srchost");
+			break;
+		case CAKE_FLOW_DUAL_DST:
+			print_string(PRINT_ANY, "flowmode", "%s ", "dual-dsthost");
+			break;
+		case CAKE_FLOW_TRIPLE:
+			print_string(PRINT_ANY, "flowmode", "%s ", "triple-isolate");
+			break;
+		default:
+			print_string(PRINT_ANY, "flowmode", "(?flowmode?) ", "unknown");
+			break;
+		};
+
+	}
+
+	if (tb[TCA_CAKE_NAT] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_NAT]) >= sizeof(__u32)) {
+	    nat = rta_getattr_u32(tb[TCA_CAKE_NAT]);
+	}
+
+	if(nat)
+		print_string(PRINT_FP, NULL, "nat ", NULL);
+	print_bool(PRINT_JSON, "nat", NULL, nat);
+
+	if (tb[TCA_CAKE_WASH] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_WASH]) >= sizeof(__u32)) {
+		wash = rta_getattr_u32(tb[TCA_CAKE_WASH]);
+	}
+	if (tb[TCA_CAKE_ATM] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_ATM]) >= sizeof(__u32)) {
+		atm = rta_getattr_u32(tb[TCA_CAKE_ATM]);
+	}
+	if (tb[TCA_CAKE_OVERHEAD] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_OVERHEAD]) >= sizeof(__s32)) {
+		overhead = *(__s32 *) RTA_DATA(tb[TCA_CAKE_OVERHEAD]);
+	}
+	if (tb[TCA_CAKE_MPU] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_MPU]) >= sizeof(__u32)) {
+		mpu = rta_getattr_u32(tb[TCA_CAKE_MPU]);
+	}
+	if (tb[TCA_CAKE_INGRESS] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_INGRESS]) >= sizeof(__u32)) {
+		ingress = rta_getattr_u32(tb[TCA_CAKE_INGRESS]);
+	}
+	if (tb[TCA_CAKE_ACK_FILTER] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_ACK_FILTER]) >= sizeof(__u32)) {
+		ack_filter = rta_getattr_u32(tb[TCA_CAKE_ACK_FILTER]);
+	}
+	if (tb[TCA_CAKE_SPLIT_GSO] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_SPLIT_GSO]) >= sizeof(__u32)) {
+		split_gso = rta_getattr_u32(tb[TCA_CAKE_SPLIT_GSO]);
+	}
+	if (tb[TCA_CAKE_RAW]) {
+		raw = 1;
+	}
+	if (tb[TCA_CAKE_RTT] &&
+	    RTA_PAYLOAD(tb[TCA_CAKE_RTT]) >= sizeof(__u32)) {
+		interval = rta_getattr_u32(tb[TCA_CAKE_RTT]);
+	}
+
+	if (wash)
+		print_string(PRINT_FP, NULL, "wash ", NULL);
+	print_bool(PRINT_JSON, "wash", NULL, wash);
+
+	if (ingress)
+		print_string(PRINT_FP, NULL, "ingress ", NULL);
+	print_bool(PRINT_JSON, "ingress", NULL, ingress);
+
+	if (ack_filter == CAKE_ACK_AGGRESSIVE)
+		print_string(PRINT_ANY, "ack-filter", "ack-filter-%s ", "aggressive");
+	else if (ack_filter == CAKE_ACK_FILTER)
+		print_string(PRINT_ANY, "ack-filter", "ack-filter ", "enabled");
+	else
+		print_string(PRINT_JSON, "ack-filter", NULL, "disabled");
+
+	if (split_gso)
+		print_string(PRINT_FP, NULL, "split-gso ", NULL);
+	print_bool(PRINT_JSON, "split_gso", NULL, split_gso);
+
+	if (interval)
+		print_string(PRINT_FP, NULL, "rtt %s ", sprint_time(interval, b2));
+	print_uint(PRINT_JSON, "rtt", NULL, interval);
+
+	if (raw)
+		print_string(PRINT_FP, NULL, "raw ", NULL);
+	print_bool(PRINT_JSON, "raw", NULL, raw);
+
+	if (atm == CAKE_ATM_ATM)
+		print_string(PRINT_ANY, "atm", "%s ", "atm");
+	else if (atm == CAKE_ATM_PTM)
+		print_string(PRINT_ANY, "atm", "%s ", "ptm");
+	else if (!raw)
+		print_string(PRINT_ANY, "atm", "%s ", "noatm");
+
+	print_int(PRINT_ANY, "overhead", "overhead %d ", overhead);
+
+	if (mpu)
+		print_uint(PRINT_ANY, "mpu", "mpu %u ", mpu);
+
+	if (memlimit) {
+		print_uint(PRINT_JSON, "memlimit", NULL, memlimit);
+		print_string(PRINT_FP, NULL, "memlimit %s", sprint_size(memlimit, b1));
+	}
+
+	return 0;
+}
+
+static void cake_print_json_tin(struct rtattr **tstat)
+{
+#define PRINT_TSTAT_JSON(type, name, attr) if (tstat[TCA_CAKE_TIN_STATS_ ## attr]) \
+		print_u64(PRINT_JSON, name, NULL,			\
+			rta_getattr_ ## type((struct rtattr *)tstat[TCA_CAKE_TIN_STATS_ ## attr]))
+
+	open_json_object(NULL);
+	PRINT_TSTAT_JSON(u32, "threshold_rate", THRESHOLD_RATE);
+	PRINT_TSTAT_JSON(u32, "target_us", TARGET_US);
+	PRINT_TSTAT_JSON(u32, "interval_us", INTERVAL_US);
+	PRINT_TSTAT_JSON(u32, "peak_delay_us", PEAK_DELAY_US);
+	PRINT_TSTAT_JSON(u32, "avg_delay_us", AVG_DELAY_US);
+	PRINT_TSTAT_JSON(u32, "base_delay_us", BASE_DELAY_US);
+	PRINT_TSTAT_JSON(u32, "sent_packets", SENT_PACKETS);
+	PRINT_TSTAT_JSON(u64, "sent_bytes", SENT_BYTES64);
+	PRINT_TSTAT_JSON(u32, "way_indirect_hits", WAY_INDIRECT_HITS);
+	PRINT_TSTAT_JSON(u32, "way_misses", WAY_MISSES);
+	PRINT_TSTAT_JSON(u32, "way_collisions", WAY_COLLISIONS);
+	PRINT_TSTAT_JSON(u32, "drops", DROPPED_PACKETS);
+	PRINT_TSTAT_JSON(u32, "ecn_mark", ECN_MARKED_PACKETS);
+	PRINT_TSTAT_JSON(u32, "ack_drops", ACKS_DROPPED_PACKETS);
+	PRINT_TSTAT_JSON(u32, "sparse_flows", SPARSE_FLOWS);
+	PRINT_TSTAT_JSON(u32, "bulk_flows", BULK_FLOWS);
+	PRINT_TSTAT_JSON(u32, "unresponsive_flows", UNRESPONSIVE_FLOWS);
+	PRINT_TSTAT_JSON(u32, "max_pkt_len", MAX_SKBLEN);
+	PRINT_TSTAT_JSON(u32, "flow_quantum", FLOW_QUANTUM);
+	close_json_object();
+
+#undef PRINT_TSTAT_JSON
+}
+
+static int cake_print_xstats(struct qdisc_util *qu, FILE *f,
+			     struct rtattr *xstats)
+{
+	SPRINT_BUF(b1);
+	struct rtattr *st[TCA_CAKE_STATS_MAX + 1];
+	int i;
+
+	if (xstats == NULL)
+		return 0;
+
+#define GET_STAT_U32(attr) rta_getattr_u32(st[TCA_CAKE_STATS_ ## attr])
+
+	parse_rtattr_nested(st, TCA_CAKE_STATS_MAX, xstats);
+
+	if (st[TCA_CAKE_STATS_MEMORY_USED] &&
+	    st[TCA_CAKE_STATS_MEMORY_LIMIT]) {
+		print_string(PRINT_FP, NULL, " memory used: %s",
+			sprint_size(GET_STAT_U32(MEMORY_USED), b1));
+
+		print_string(PRINT_FP, NULL, " of %s\n",
+			sprint_size(GET_STAT_U32(MEMORY_LIMIT), b1));
+
+		print_uint(PRINT_JSON, "memory_used", NULL,
+			GET_STAT_U32(MEMORY_USED));
+		print_uint(PRINT_JSON, "memory_limit", NULL,
+			GET_STAT_U32(MEMORY_LIMIT));
+	}
+
+	if (st[TCA_CAKE_STATS_CAPACITY_ESTIMATE]) {
+		print_string(PRINT_FP, NULL, " capacity estimate: %s\n",
+			sprint_rate(GET_STAT_U32(CAPACITY_ESTIMATE), b1));
+		print_uint(PRINT_JSON, "capacity_estimate", NULL,
+			GET_STAT_U32(CAPACITY_ESTIMATE));
+	}
+
+	if (st[TCA_CAKE_STATS_MIN_NETLEN] &&
+	    st[TCA_CAKE_STATS_MAX_NETLEN]) {
+		print_uint(PRINT_ANY, "min_network_size",
+			   " min/max network layer size: %12u",
+			   GET_STAT_U32(MIN_NETLEN));
+		print_uint(PRINT_ANY, "max_network_size",
+			   " /%8u\n", GET_STAT_U32(MAX_NETLEN));
+	}
+
+	if (st[TCA_CAKE_STATS_MIN_ADJLEN] &&
+	    st[TCA_CAKE_STATS_MAX_ADJLEN]) {
+		print_uint(PRINT_ANY, "min_adj_size",
+			   " min/max overhead-adjusted size: %8u",
+			   GET_STAT_U32(MIN_ADJLEN));
+		print_uint(PRINT_ANY, "max_adj_size",
+			   " /%8u\n", GET_STAT_U32(MAX_ADJLEN));
+	}
+
+	if (st[TCA_CAKE_STATS_AVG_NETOFF])
+		print_uint(PRINT_ANY, "avg_hdr_offset",
+			   " average network hdr offset: %12u\n\n",
+			   GET_STAT_U32(AVG_NETOFF));
+
+#undef GET_STAT_U32
+
+	if (st[TCA_CAKE_STATS_TIN_STATS]) {
+		struct rtattr *tins[TC_CAKE_MAX_TINS + 1];
+		struct rtattr *tstat[TC_CAKE_MAX_TINS][TCA_CAKE_TIN_STATS_MAX + 1];
+		int num_tins = 0;
+
+		parse_rtattr_nested(tins, TC_CAKE_MAX_TINS, st[TCA_CAKE_STATS_TIN_STATS]);
+
+		for (i = 1; i <= TC_CAKE_MAX_TINS && tins[i]; i++) {
+			parse_rtattr_nested(tstat[i-1], TCA_CAKE_TIN_STATS_MAX, tins[i]);
+			num_tins++;
+		}
+
+		if (!num_tins)
+			return 0;
+
+		if (is_json_context()) {
+			open_json_array(PRINT_JSON, "tins");
+			for (i = 0; i < num_tins; i++)
+				cake_print_json_tin(tstat[i]);
+			close_json_array(PRINT_JSON, NULL);
+
+			return 0;
+		}
+
+
+		switch(num_tins) {
+		case 3:
+			fprintf(f, "                   Bulk  Best Effort        Voice\n");
+			break;
+
+		case 4:
+			fprintf(f, "                   Bulk  Best Effort        Video        Voice\n");
+			break;
+
+		default:
+			fprintf(f, "          ");
+			for(i=0; i < num_tins; i++)
+				fprintf(f, "        Tin %u", i);
+			fprintf(f, "\n");
+		};
+
+#define GET_TSTAT(i, attr) (tstat[i][TCA_CAKE_TIN_STATS_ ## attr])
+#define PRINT_TSTAT(name, attr, fmts, val)	do {		\
+			if (GET_TSTAT(0, attr)) {		\
+				fprintf(f, name);		\
+				for (i = 0; i < num_tins; i++)	\
+					fprintf(f, " %12" fmts,	val);	\
+				fprintf(f, "\n");			\
+			}						\
+		} while (0)
+
+#define SPRINT_TSTAT(pfunc, name, attr) PRINT_TSTAT(		\
+			name, attr, "s", sprint_ ## pfunc(		\
+				rta_getattr_u32(GET_TSTAT(i, attr)), b1))
+
+#define PRINT_TSTAT_U32(name, attr)	PRINT_TSTAT(			\
+			name, attr, "u", rta_getattr_u32(GET_TSTAT(i, attr)))
+
+#define PRINT_TSTAT_U64(name, attr)	PRINT_TSTAT(			\
+			name, attr, "llu", rta_getattr_u64(GET_TSTAT(i, attr)))
+
+		SPRINT_TSTAT(rate, "  thresh  ", THRESHOLD_RATE);
+		SPRINT_TSTAT(time, "  target  ", TARGET_US);
+		SPRINT_TSTAT(time, "  interval", INTERVAL_US);
+		SPRINT_TSTAT(time, "  pk_delay", PEAK_DELAY_US);
+		SPRINT_TSTAT(time, "  av_delay", AVG_DELAY_US);
+		SPRINT_TSTAT(time, "  sp_delay", BASE_DELAY_US);
+
+		PRINT_TSTAT_U32("  pkts    ", SENT_PACKETS);
+		PRINT_TSTAT_U64("  bytes   ", SENT_BYTES64);
+
+		PRINT_TSTAT_U32("  way_inds", WAY_INDIRECT_HITS);
+		PRINT_TSTAT_U32("  way_miss", WAY_MISSES);
+		PRINT_TSTAT_U32("  way_cols", WAY_COLLISIONS);
+		PRINT_TSTAT_U32("  drops   ", DROPPED_PACKETS);
+		PRINT_TSTAT_U32("  marks   ", ECN_MARKED_PACKETS);
+		PRINT_TSTAT_U32("  ack_drop", ACKS_DROPPED_PACKETS);
+		PRINT_TSTAT_U32("  sp_flows", SPARSE_FLOWS);
+		PRINT_TSTAT_U32("  bk_flows", BULK_FLOWS);
+		PRINT_TSTAT_U32("  un_flows", UNRESPONSIVE_FLOWS);
+		PRINT_TSTAT_U32("  max_len ", MAX_SKBLEN);
+		PRINT_TSTAT_U32("  quantum ", FLOW_QUANTUM);
+
+#undef GET_STAT
+#undef PRINT_TSTAT
+#undef SPRINT_TSTAT
+#undef PRINT_TSTAT_U32
+#undef PRINT_TSTAT_U64
+	}
+	return 0;
+}
+
+struct qdisc_util cake_qdisc_util = {
+	.id		= "cake",
+	.parse_qopt	= cake_parse_opt,
+	.print_qopt	= cake_print_opt,
+	.print_xstats	= cake_print_xstats,
+};
-- 
2.17.0

^ permalink raw reply related

* [PATCH bpf-next v2 00/15] Introducing AF_XDP support
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang

From: Björn Töpel <bjorn.topel@intel.com>

This patch set introduces a new address family called AF_XDP that is
optimized for high performance packet processing and, in upcoming
patch sets, zero-copy semantics. In this v2 version, we have removed
all zero-copy related code in order to make it smaller, simpler and
hopefully more review friendly. This patch set only supports copy-mode
for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
for RX using the XDP_DRV path. Zero-copy support requires XDP and
driver changes that Jesper Dangaard Brouer is working on. Some of his
work has already been accepted. We will publish our zero-copy support
for RX and TX on top of his patch sets at a later point in time.

An AF_XDP socket (XSK) is created with the normal socket()
syscall. Associated with each XSK are two queues: the RX queue and the
TX queue. A socket can receive packets on the RX queue and it can send
packets on the TX queue. These queues are registered and sized with
the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
mandatory to have at least one of these queues for each socket. In
contrast to AF_PACKET V2/V3 these descriptor queues are separated from
packet buffers. An RX or TX descriptor points to a data buffer in a
memory area called a UMEM. RX and TX can share the same UMEM so that a
packet does not have to be copied between RX and TX. Moreover, if a
packet needs to be kept for a while due to a possible retransmit, the
descriptor that points to that packet can be changed to point to
another and reused right away. This again avoids copying data.

This new dedicated packet buffer area is call a UMEM. It consists of a
number of equally size frames and each frame has a unique frame id. A
descriptor in one of the queues references a frame by referencing its
frame id. The user space allocates memory for this UMEM using whatever
means it feels is most appropriate (malloc, mmap, huge pages,
etc). This memory area is then registered with the kernel using the new
setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
and the COMPLETION queue. The fill queue is used by the application to
send down frame ids for the kernel to fill in with RX packet
data. References to these frames will then appear in the RX queue of
the XSK once they have been received. The completion queue, on the
other hand, contains frame ids that the kernel has transmitted
completely and can now be used again by user space, for either TX or
RX. Thus, the frame ids appearing in the completion queue are ids that
were previously transmitted using the TX queue. In summary, the RX and
FILL queues are used for the RX path and the TX and COMPLETION queues
are used for the TX path.

The socket is then finally bound with a bind() call to a device and a
specific queue id on that device, and it is not until bind is
completed that traffic starts to flow. Note that in this patch set,
all packet data is copied out to user-space.

A new feature in this patch set is that the UMEM can be shared between
processes, if desired. If a process wants to do this, it simply skips
the registration of the UMEM and its corresponding two queues, sets a
flag in the bind call and submits the XSK of the process it would like
to share UMEM with as well as its own newly created XSK socket. The
new process will then receive frame id references in its own RX queue
that point to this shared UMEM. Note that since the queue structures
are single-consumer / single-producer (for performance reasons), the
new process has to create its own socket with associated RX and TX
queues, since it cannot share this with the other process. This is
also the reason that there is only one set of FILL and COMPLETION
queues per UMEM. It is the responsibility of a single process to
handle the UMEM. If multiple-producer / multiple-consumer queues are
implemented in the future, this requirement could be relaxed.

How is then packets distributed between these two XSK? We have
introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
full). The user-space application can place an XSK at an arbitrary
place in this map. The XDP program can then redirect a packet to a
specific index in this map and at this point XDP validates that the
XSK in that map was indeed bound to that device and queue number. If
not, the packet is dropped. If the map is empty at that index, the
packet is also dropped. This also means that it is currently mandatory
to have an XDP program loaded (and one XSK in the XSKMAP) to be able
to get any traffic to user space through the XSK.

AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
driver does not have support for XDP, or XDP_SKB is explicitly chosen
when loading the XDP program, XDP_SKB mode is employed that uses SKBs
together with the generic XDP support and copies out the data to user
space. A fallback mode that works for any network device. On the other
hand, if the driver has support for XDP, it will be used by the AF_XDP
code to provide better performance, but there is still a copy of the
data into user space.

There is a xdpsock benchmarking/test application included that
demonstrates how to use AF_XDP sockets with both private and shared
UMEMs. Say that you would like your UDP traffic from port 4242 to end
up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
for this:

      ethtool -N p3p2 rx-flow-hash udp4 fn
      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
          action 16

Running the rxdrop benchmark in XDP_DRV mode can then be done
using:

      samples/bpf/xdpsock -i p3p2 -q 16 -r -N

For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
can be displayed with "-h", as usual.

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by commercial packet generator HW that is
generating packets at full 40 Gbit/s line rate.

AF_XDP performance 64 byte packets. Results from V1 in parenthesis.
Benchmark   XDP_SKB   XDP_DRV
rxdrop       3.0(2.9)   9.5(9.4)  
txpush       2.5(2.5)   NA*
l2fwd        1.9(1.9)   2.5(2.4) (TX using XDP_SKB in both cases)

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV
rxdrop       2.2(2.1)   3.3(3.3)  
l2fwd        1.4(1.4)   1.8(1.8) (TX using XDP_SKB in both cases)

* NA since we have no support for TX using the XDP_DRV infrastructure
  in this patch set. This is for a future patch set since it involves
  changes to the XDP NDOs. Some of this has been upstreamed by Jesper
  Dangaard Brouer.

XDP performance on our system as a base line:

64 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      32,921,521  0

1500 byte packets:
XDP stats       CPU     pps         issue-pps
XDP-RX CPU      16      3,289,491   0

Changes from V1:

* Fixes to bugs spotted by Will in his review
* Implemented the performance otimization to BPF_MAP_TYPE_XSKMAP 
  suggested by Will
* Refactored packet_direct_xmit to become a common function
  in core/dev.c as suggested by Will
* Added documentation as suggested by Jesper
* Proper page unpinning as suggested by MST
* Some minor code cleanups

The structure of the patch set is as follows:

Patches 1-3: Basic socket and umem plumbing 
Patches 4-9: RX support together with the new XSKMAP
Patches 10-13: TX support
Patch 14: Statistics support with getsockopt()
Patch 15: Sample application

We based this patch set on bpf-next commit 

To do for this patch set:

* Syzkaller torture session being worked on

Post-series plan:

* Optimize performance

* Kernel selftest

* Kernel load module support of AF_XDP would be nice. Unclear how to
  achieve this though since our XDP code depends on net/core.

* Support for AF_XDP sockets without an XPD program loaded. In this
  case all the traffic on a queue should go up to the user space socket.

* Daniel Borkmann's suggestion for a "copy to XDP socket, and return
  XDP_PASS" for a tcpdump-like functionality.

* And of course getting to zero-copy support in small increments. 

Thanks: Björn and Magnus

Björn Töpel (7):
  net: initial AF_XDP skeleton
  xsk: add user memory registration support sockopt
  xsk: add Rx queue setup and mmap support
  xsk: add Rx receive functions and poll support
  bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
  xsk: wire up XDP_DRV side of AF_XDP
  xsk: wire up XDP_SKB side of AF_XDP

Magnus Karlsson (8):
  xsk: add umem fill queue support and mmap
  xsk: add support for bind for Rx
  xsk: add umem completion queue support and mmap
  xsk: add Tx queue setup and mmap support
  dev: packet: make packet_direct_xmit a common function
  xsk: support for Tx
  xsk: statistics support
  samples/bpf: sample application and documentation for AF_XDP sockets

 Documentation/networking/af_xdp.rst | 297 +++++++++++
 Documentation/networking/index.rst  |   1 +
 MAINTAINERS                         |   8 +
 include/linux/bpf.h                 |  26 +
 include/linux/bpf_types.h           |   3 +
 include/linux/filter.h              |   2 +-
 include/linux/netdevice.h           |   1 +
 include/linux/socket.h              |   5 +-
 include/net/xdp.h                   |   1 +
 include/net/xdp_sock.h              |  66 +++
 include/uapi/linux/bpf.h            |   1 +
 include/uapi/linux/if_xdp.h         |  87 ++++
 kernel/bpf/Makefile                 |   3 +
 kernel/bpf/verifier.c               |   8 +-
 kernel/bpf/xskmap.c                 | 272 +++++++++++
 net/Kconfig                         |   1 +
 net/Makefile                        |   1 +
 net/core/dev.c                      |  73 ++-
 net/core/filter.c                   |  40 +-
 net/core/sock.c                     |  12 +-
 net/core/xdp.c                      |  15 +-
 net/packet/af_packet.c              |  42 +-
 net/xdp/Kconfig                     |   7 +
 net/xdp/Makefile                    |   2 +
 net/xdp/xdp_umem.c                  | 260 ++++++++++
 net/xdp/xdp_umem.h                  |  67 +++
 net/xdp/xdp_umem_props.h            |  23 +
 net/xdp/xsk.c                       | 656 +++++++++++++++++++++++++
 net/xdp/xsk_queue.c                 |  73 +++
 net/xdp/xsk_queue.h                 | 247 ++++++++++
 samples/bpf/Makefile                |   4 +
 samples/bpf/xdpsock.h               |  11 +
 samples/bpf/xdpsock_kern.c          |  56 +++
 samples/bpf/xdpsock_user.c          | 948 ++++++++++++++++++++++++++++++++++++
 security/selinux/hooks.c            |   4 +-
 security/selinux/include/classmap.h |   4 +-
 36 files changed, 3255 insertions(+), 72 deletions(-)
 create mode 100644 Documentation/networking/af_xdp.rst
 create mode 100644 include/net/xdp_sock.h
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 kernel/bpf/xskmap.c
 create mode 100644 net/xdp/Kconfig
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xdp_umem.c
 create mode 100644 net/xdp/xdp_umem.h
 create mode 100644 net/xdp/xdp_umem_props.h
 create mode 100644 net/xdp/xsk.c
 create mode 100644 net/xdp/xsk_queue.c
 create mode 100644 net/xdp/xsk_queue.h
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c

-- 
2.14.1

^ permalink raw reply

* [PATCH bpf-next v2 01/15] net: initial AF_XDP skeleton
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Buildable skeleton of AF_XDP without any functionality. Just what it
takes to register a new address family.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 MAINTAINERS                         |  8 ++++++++
 include/linux/socket.h              |  5 ++++-
 net/Kconfig                         |  1 +
 net/core/sock.c                     | 12 ++++++++----
 net/xdp/Kconfig                     |  7 +++++++
 security/selinux/hooks.c            |  4 +++-
 security/selinux/include/classmap.h |  4 +++-
 7 files changed, 34 insertions(+), 7 deletions(-)
 create mode 100644 net/xdp/Kconfig

diff --git a/MAINTAINERS b/MAINTAINERS
index a52800867850..79e155976e19 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15425,6 +15425,14 @@ T:	git git://linuxtv.org/media_tree.git
 S:	Maintained
 F:	drivers/media/tuners/tuner-xc2028.*
 
+XDP SOCKETS (AF_XDP)
+M:	Björn Töpel <bjorn.topel@intel.com>
+M:	Magnus Karlsson <magnus.karlsson@intel.com>
+L:	netdev@vger.kernel.org
+S:	Maintained
+F:	kernel/bpf/xskmap.c
+F:	net/xdp/
+
 XEN BLOCK SUBSYSTEM
 M:	Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
 M:	Roger Pau Monné <roger.pau@citrix.com>
diff --git a/include/linux/socket.h b/include/linux/socket.h
index ea50f4a65816..7ed4713d5337 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -207,8 +207,9 @@ struct ucred {
 				 * PF_SMC protocol family that
 				 * reuses AF_INET address family
 				 */
+#define AF_XDP		44	/* XDP sockets			*/
 
-#define AF_MAX		44	/* For now.. */
+#define AF_MAX		45	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -257,6 +258,7 @@ struct ucred {
 #define PF_KCM		AF_KCM
 #define PF_QIPCRTR	AF_QIPCRTR
 #define PF_SMC		AF_SMC
+#define PF_XDP		AF_XDP
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
@@ -338,6 +340,7 @@ struct ucred {
 #define SOL_NFC		280
 #define SOL_KCM		281
 #define SOL_TLS		282
+#define SOL_XDP		283
 
 /* IPX options */
 #define IPX_TYPE	1
diff --git a/net/Kconfig b/net/Kconfig
index 6fa1a4493b8c..86471a1c1ed4 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -59,6 +59,7 @@ source "net/tls/Kconfig"
 source "net/xfrm/Kconfig"
 source "net/iucv/Kconfig"
 source "net/smc/Kconfig"
+source "net/xdp/Kconfig"
 
 config INET
 	bool "TCP/IP networking"
diff --git a/net/core/sock.c b/net/core/sock.c
index b2c3db169ca1..e7d8b6c955c6 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -226,7 +226,8 @@ static struct lock_class_key af_family_kern_slock_keys[AF_MAX];
   x "AF_RXRPC" ,	x "AF_ISDN"     ,	x "AF_PHONET"   , \
   x "AF_IEEE802154",	x "AF_CAIF"	,	x "AF_ALG"      , \
   x "AF_NFC"   ,	x "AF_VSOCK"    ,	x "AF_KCM"      , \
-  x "AF_QIPCRTR",	x "AF_SMC"	,	x "AF_MAX"
+  x "AF_QIPCRTR",	x "AF_SMC"	,	x "AF_XDP"	, \
+  x "AF_MAX"
 
 static const char *const af_family_key_strings[AF_MAX+1] = {
 	_sock_locks("sk_lock-")
@@ -262,7 +263,8 @@ static const char *const af_family_rlock_key_strings[AF_MAX+1] = {
   "rlock-AF_RXRPC" , "rlock-AF_ISDN"     , "rlock-AF_PHONET"   ,
   "rlock-AF_IEEE802154", "rlock-AF_CAIF" , "rlock-AF_ALG"      ,
   "rlock-AF_NFC"   , "rlock-AF_VSOCK"    , "rlock-AF_KCM"      ,
-  "rlock-AF_QIPCRTR", "rlock-AF_SMC"     , "rlock-AF_MAX"
+  "rlock-AF_QIPCRTR", "rlock-AF_SMC"     , "rlock-AF_XDP"      ,
+  "rlock-AF_MAX"
 };
 static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
   "wlock-AF_UNSPEC", "wlock-AF_UNIX"     , "wlock-AF_INET"     ,
@@ -279,7 +281,8 @@ static const char *const af_family_wlock_key_strings[AF_MAX+1] = {
   "wlock-AF_RXRPC" , "wlock-AF_ISDN"     , "wlock-AF_PHONET"   ,
   "wlock-AF_IEEE802154", "wlock-AF_CAIF" , "wlock-AF_ALG"      ,
   "wlock-AF_NFC"   , "wlock-AF_VSOCK"    , "wlock-AF_KCM"      ,
-  "wlock-AF_QIPCRTR", "wlock-AF_SMC"     , "wlock-AF_MAX"
+  "wlock-AF_QIPCRTR", "wlock-AF_SMC"     , "wlock-AF_XDP"      ,
+  "wlock-AF_MAX"
 };
 static const char *const af_family_elock_key_strings[AF_MAX+1] = {
   "elock-AF_UNSPEC", "elock-AF_UNIX"     , "elock-AF_INET"     ,
@@ -296,7 +299,8 @@ static const char *const af_family_elock_key_strings[AF_MAX+1] = {
   "elock-AF_RXRPC" , "elock-AF_ISDN"     , "elock-AF_PHONET"   ,
   "elock-AF_IEEE802154", "elock-AF_CAIF" , "elock-AF_ALG"      ,
   "elock-AF_NFC"   , "elock-AF_VSOCK"    , "elock-AF_KCM"      ,
-  "elock-AF_QIPCRTR", "elock-AF_SMC"     , "elock-AF_MAX"
+  "elock-AF_QIPCRTR", "elock-AF_SMC"     , "elock-AF_XDP"      ,
+  "elock-AF_MAX"
 };
 
 /*
diff --git a/net/xdp/Kconfig b/net/xdp/Kconfig
new file mode 100644
index 000000000000..90e4a7152854
--- /dev/null
+++ b/net/xdp/Kconfig
@@ -0,0 +1,7 @@
+config XDP_SOCKETS
+	bool "XDP sockets"
+	depends on BPF_SYSCALL
+	default n
+	help
+	  XDP sockets allows a channel between XDP programs and
+	  userspace applications.
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 4cafe6a19167..5c508d26b367 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -1471,7 +1471,9 @@ static inline u16 socket_type_to_security_class(int family, int type, int protoc
 			return SECCLASS_QIPCRTR_SOCKET;
 		case PF_SMC:
 			return SECCLASS_SMC_SOCKET;
-#if PF_MAX > 44
+		case PF_XDP:
+			return SECCLASS_XDP_SOCKET;
+#if PF_MAX > 45
 #error New address family defined, please update this function.
 #endif
 		}
diff --git a/security/selinux/include/classmap.h b/security/selinux/include/classmap.h
index 7f0372426494..bd5fe0d3204a 100644
--- a/security/selinux/include/classmap.h
+++ b/security/selinux/include/classmap.h
@@ -240,9 +240,11 @@ struct security_class_mapping secclass_map[] = {
 	  { "manage_subnet", NULL } },
 	{ "bpf",
 	  {"map_create", "map_read", "map_write", "prog_load", "prog_run"} },
+	{ "xdp_socket",
+	  { COMMON_SOCK_PERMS, NULL } },
 	{ NULL }
   };
 
-#if PF_MAX > 44
+#if PF_MAX > 45
 #error New address family defined, please update secclass_map.
 #endif
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 02/15] xsk: add user memory registration support sockopt
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

In this commit the base structure of the AF_XDP address family is set
up. Further, we introduce the abilty register a window of user memory
to the kernel via the XDP_UMEM_REG setsockopt syscall. The memory
window is viewed by an AF_XDP socket as a set of equally large
frames. After a user memory registration all frames are "owned" by the
user application, and not the kernel.

v2: More robust checks on umem creation and unaccount on error.
    Call set_page_dirty_lock on cleanup.
    Simplified xdp_umem_reg.

Co-authored-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h      |  31 ++++++
 include/uapi/linux/if_xdp.h |  34 ++++++
 net/Makefile                |   1 +
 net/xdp/Makefile            |   2 +
 net/xdp/xdp_umem.c          | 245 ++++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xdp_umem.h          |  45 ++++++++
 net/xdp/xdp_umem_props.h    |  23 +++++
 net/xdp/xsk.c               | 215 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 596 insertions(+)
 create mode 100644 include/net/xdp_sock.h
 create mode 100644 include/uapi/linux/if_xdp.h
 create mode 100644 net/xdp/Makefile
 create mode 100644 net/xdp/xdp_umem.c
 create mode 100644 net/xdp/xdp_umem.h
 create mode 100644 net/xdp/xdp_umem_props.h
 create mode 100644 net/xdp/xsk.c

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
new file mode 100644
index 000000000000..94785f5db13e
--- /dev/null
+++ b/include/net/xdp_sock.h
@@ -0,0 +1,31 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * AF_XDP internal functions
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XDP_SOCK_H
+#define _LINUX_XDP_SOCK_H
+
+#include <linux/mutex.h>
+#include <net/sock.h>
+
+struct xdp_umem;
+
+struct xdp_sock {
+	/* struct sock must be the first member of struct xdp_sock */
+	struct sock sk;
+	struct xdp_umem *umem;
+	/* Protects multiple processes in the control path */
+	struct mutex mutex;
+};
+
+#endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
new file mode 100644
index 000000000000..41252135a0fe
--- /dev/null
+++ b/include/uapi/linux/if_xdp.h
@@ -0,0 +1,34 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+ *
+ * if_xdp: XDP socket user-space interface
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Author(s): Björn Töpel <bjorn.topel@intel.com>
+ *	      Magnus Karlsson <magnus.karlsson@intel.com>
+ */
+
+#ifndef _LINUX_IF_XDP_H
+#define _LINUX_IF_XDP_H
+
+#include <linux/types.h>
+
+/* XDP socket options */
+#define XDP_UMEM_REG			3
+
+struct xdp_umem_reg {
+	__u64 addr; /* Start of packet data area */
+	__u64 len; /* Length of packet data area */
+	__u32 frame_size; /* Frame size */
+	__u32 frame_headroom; /* Frame head room */
+};
+
+#endif /* _LINUX_IF_XDP_H */
diff --git a/net/Makefile b/net/Makefile
index a6147c61b174..77aaddedbd29 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -85,3 +85,4 @@ obj-y				+= l3mdev/
 endif
 obj-$(CONFIG_QRTR)		+= qrtr/
 obj-$(CONFIG_NET_NCSI)		+= ncsi/
+obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
new file mode 100644
index 000000000000..a5d736640a0f
--- /dev/null
+++ b/net/xdp/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
+
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
new file mode 100644
index 000000000000..ec8b3552be44
--- /dev/null
+++ b/net/xdp/xdp_umem.c
@@ -0,0 +1,245 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/init.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/uaccess.h>
+#include <linux/slab.h>
+#include <linux/bpf.h>
+#include <linux/mm.h>
+
+#include "xdp_umem.h"
+
+#define XDP_UMEM_MIN_FRAME_SIZE 2048
+
+int xdp_umem_create(struct xdp_umem **umem)
+{
+	*umem = kzalloc(sizeof(**umem), GFP_KERNEL);
+
+	if (!(*umem))
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void xdp_umem_unpin_pages(struct xdp_umem *umem)
+{
+	unsigned int i;
+
+	if (umem->pgs) {
+		for (i = 0; i < umem->npgs; i++) {
+			struct page *page = umem->pgs[i];
+
+			set_page_dirty_lock(page);
+			put_page(page);
+		}
+
+		kfree(umem->pgs);
+		umem->pgs = NULL;
+	}
+}
+
+static void xdp_umem_unaccount_pages(struct xdp_umem *umem)
+{
+	if (umem->user) {
+		atomic_long_sub(umem->npgs, &umem->user->locked_vm);
+		free_uid(umem->user);
+	}
+}
+
+static void xdp_umem_release(struct xdp_umem *umem)
+{
+	struct task_struct *task;
+	struct mm_struct *mm;
+
+	if (umem->pgs) {
+		xdp_umem_unpin_pages(umem);
+
+		task = get_pid_task(umem->pid, PIDTYPE_PID);
+		put_pid(umem->pid);
+		if (!task)
+			goto out;
+		mm = get_task_mm(task);
+		put_task_struct(task);
+		if (!mm)
+			goto out;
+
+		mmput(mm);
+		umem->pgs = NULL;
+	}
+
+	xdp_umem_unaccount_pages(umem);
+out:
+	kfree(umem);
+}
+
+static void xdp_umem_release_deferred(struct work_struct *work)
+{
+	struct xdp_umem *umem = container_of(work, struct xdp_umem, work);
+
+	xdp_umem_release(umem);
+}
+
+void xdp_get_umem(struct xdp_umem *umem)
+{
+	atomic_inc(&umem->users);
+}
+
+void xdp_put_umem(struct xdp_umem *umem)
+{
+	if (!umem)
+		return;
+
+	if (atomic_dec_and_test(&umem->users)) {
+		INIT_WORK(&umem->work, xdp_umem_release_deferred);
+		schedule_work(&umem->work);
+	}
+}
+
+static int xdp_umem_pin_pages(struct xdp_umem *umem)
+{
+	unsigned int gup_flags = FOLL_WRITE;
+	long npgs;
+	int err;
+
+	umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), GFP_KERNEL);
+	if (!umem->pgs)
+		return -ENOMEM;
+
+	down_write(&current->mm->mmap_sem);
+	npgs = get_user_pages(umem->address, umem->npgs,
+			      gup_flags, &umem->pgs[0], NULL);
+	up_write(&current->mm->mmap_sem);
+
+	if (npgs != umem->npgs) {
+		if (npgs >= 0) {
+			umem->npgs = npgs;
+			err = -ENOMEM;
+			goto out_pin;
+		}
+		err = npgs;
+		goto out_pgs;
+	}
+	return 0;
+
+out_pin:
+	xdp_umem_unpin_pages(umem);
+out_pgs:
+	kfree(umem->pgs);
+	umem->pgs = NULL;
+	return err;
+}
+
+static int xdp_umem_account_pages(struct xdp_umem *umem)
+{
+	unsigned long lock_limit, new_npgs, old_npgs;
+
+	if (capable(CAP_IPC_LOCK))
+		return 0;
+
+	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+	umem->user = get_uid(current_user());
+
+	do {
+		old_npgs = atomic_long_read(&umem->user->locked_vm);
+		new_npgs = old_npgs + umem->npgs;
+		if (new_npgs > lock_limit) {
+			free_uid(umem->user);
+			umem->user = NULL;
+			return -ENOBUFS;
+		}
+	} while (atomic_long_cmpxchg(&umem->user->locked_vm, old_npgs,
+				     new_npgs) != old_npgs);
+	return 0;
+}
+
+int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
+{
+	u32 frame_size = mr->frame_size, frame_headroom = mr->frame_headroom;
+	u64 addr = mr->addr, size = mr->len;
+	unsigned int nframes, nfpp;
+	int size_chk, err;
+
+	if (!umem)
+		return -EINVAL;
+
+	if (frame_size < XDP_UMEM_MIN_FRAME_SIZE || frame_size > PAGE_SIZE) {
+		/* Strictly speaking we could support this, if:
+		 * - huge pages, or*
+		 * - using an IOMMU, or
+		 * - making sure the memory area is consecutive
+		 * but for now, we simply say "computer says no".
+		 */
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(frame_size))
+		return -EINVAL;
+
+	if (!PAGE_ALIGNED(addr)) {
+		/* Memory area has to be page size aligned. For
+		 * simplicity, this might change.
+		 */
+		return -EINVAL;
+	}
+
+	if ((addr + size) < addr)
+		return -EINVAL;
+
+	nframes = size / frame_size;
+	if (nframes == 0 || nframes > UINT_MAX)
+		return -EINVAL;
+
+	nfpp = PAGE_SIZE / frame_size;
+	if (nframes < nfpp || nframes % nfpp)
+		return -EINVAL;
+
+	frame_headroom = ALIGN(frame_headroom, 64);
+
+	size_chk = frame_size - frame_headroom - XDP_PACKET_HEADROOM;
+	if (size_chk < 0)
+		return -EINVAL;
+
+	umem->pid = get_task_pid(current, PIDTYPE_PID);
+	umem->size = (size_t)size;
+	umem->address = (unsigned long)addr;
+	umem->props.frame_size = frame_size;
+	umem->props.nframes = nframes;
+	umem->frame_headroom = frame_headroom;
+	umem->npgs = size / PAGE_SIZE;
+	umem->pgs = NULL;
+	umem->user = NULL;
+
+	umem->frame_size_log2 = ilog2(frame_size);
+	umem->nfpp_mask = nfpp - 1;
+	umem->nfpplog2 = ilog2(nfpp);
+	atomic_set(&umem->users, 1);
+
+	err = xdp_umem_account_pages(umem);
+	if (err)
+		goto out;
+
+	err = xdp_umem_pin_pages(umem);
+	if (err)
+		goto out_account;
+	return 0;
+
+out_account:
+	xdp_umem_unaccount_pages(umem);
+out:
+	put_pid(umem->pid);
+	return err;
+}
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
new file mode 100644
index 000000000000..4597ae81a221
--- /dev/null
+++ b/net/xdp/xdp_umem.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef XDP_UMEM_H_
+#define XDP_UMEM_H_
+
+#include <linux/mm.h>
+#include <linux/if_xdp.h>
+#include <linux/workqueue.h>
+
+#include "xdp_umem_props.h"
+
+struct xdp_umem {
+	struct page **pgs;
+	struct xdp_umem_props props;
+	u32 npgs;
+	u32 frame_headroom;
+	u32 nfpp_mask;
+	u32 nfpplog2;
+	u32 frame_size_log2;
+	struct user_struct *user;
+	struct pid *pid;
+	unsigned long address;
+	size_t size;
+	atomic_t users;
+	struct work_struct work;
+};
+
+int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
+void xdp_get_umem(struct xdp_umem *umem);
+void xdp_put_umem(struct xdp_umem *umem);
+int xdp_umem_create(struct xdp_umem **umem);
+
+#endif /* XDP_UMEM_H_ */
diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h
new file mode 100644
index 000000000000..77fb5daf29f3
--- /dev/null
+++ b/net/xdp/xdp_umem_props.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * XDP user-space packet buffer
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef XDP_UMEM_PROPS_H_
+#define XDP_UMEM_PROPS_H_
+
+struct xdp_umem_props {
+	u32 frame_size;
+	u32 nframes;
+};
+
+#endif /* XDP_UMEM_PROPS_H_ */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
new file mode 100644
index 000000000000..84e0e867febb
--- /dev/null
+++ b/net/xdp/xsk.c
@@ -0,0 +1,215 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XDP sockets
+ *
+ * AF_XDP sockets allows a channel between XDP programs and userspace
+ * applications.
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Author(s): Björn Töpel <bjorn.topel@intel.com>
+ *	      Magnus Karlsson <magnus.karlsson@intel.com>
+ */
+
+#define pr_fmt(fmt) "AF_XDP: %s: " fmt, __func__
+
+#include <linux/if_xdp.h>
+#include <linux/init.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/signal.h>
+#include <linux/sched/task.h>
+#include <linux/socket.h>
+#include <linux/file.h>
+#include <linux/uaccess.h>
+#include <linux/net.h>
+#include <linux/netdevice.h>
+#include <net/xdp_sock.h>
+
+#include "xdp_umem.h"
+
+static struct xdp_sock *xdp_sk(struct sock *sk)
+{
+	return (struct xdp_sock *)sk;
+}
+
+static int xsk_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct net *net;
+
+	if (!sk)
+		return 0;
+
+	net = sock_net(sk);
+
+	local_bh_disable();
+	sock_prot_inuse_add(net, sk->sk_prot, -1);
+	local_bh_enable();
+
+	sock_orphan(sk);
+	sock->sk = NULL;
+
+	sk_refcnt_debug_release(sk);
+	sock_put(sk);
+
+	return 0;
+}
+
+static int xsk_setsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, unsigned int optlen)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	int err;
+
+	if (level != SOL_XDP)
+		return -ENOPROTOOPT;
+
+	switch (optname) {
+	case XDP_UMEM_REG:
+	{
+		struct xdp_umem_reg mr;
+		struct xdp_umem *umem;
+
+		if (xs->umem)
+			return -EBUSY;
+
+		if (copy_from_user(&mr, optval, sizeof(mr)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		err = xdp_umem_create(&umem);
+
+		err = xdp_umem_reg(umem, &mr);
+		if (err) {
+			kfree(umem);
+			mutex_unlock(&xs->mutex);
+			return err;
+		}
+
+		/* Make sure umem is ready before it can be seen by others */
+		smp_wmb();
+
+		xs->umem = umem;
+		mutex_unlock(&xs->mutex);
+		return 0;
+	}
+	default:
+		break;
+	}
+
+	return -ENOPROTOOPT;
+}
+
+static struct proto xsk_proto = {
+	.name =		"XDP",
+	.owner =	THIS_MODULE,
+	.obj_size =	sizeof(struct xdp_sock),
+};
+
+static const struct proto_ops xsk_proto_ops = {
+	.family =	PF_XDP,
+	.owner =	THIS_MODULE,
+	.release =	xsk_release,
+	.bind =		sock_no_bind,
+	.connect =	sock_no_connect,
+	.socketpair =	sock_no_socketpair,
+	.accept =	sock_no_accept,
+	.getname =	sock_no_getname,
+	.poll =		sock_no_poll,
+	.ioctl =	sock_no_ioctl,
+	.listen =	sock_no_listen,
+	.shutdown =	sock_no_shutdown,
+	.setsockopt =	xsk_setsockopt,
+	.getsockopt =	sock_no_getsockopt,
+	.sendmsg =	sock_no_sendmsg,
+	.recvmsg =	sock_no_recvmsg,
+	.mmap =		sock_no_mmap,
+	.sendpage =	sock_no_sendpage,
+};
+
+static void xsk_destruct(struct sock *sk)
+{
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (!sock_flag(sk, SOCK_DEAD))
+		return;
+
+	xdp_put_umem(xs->umem);
+
+	sk_refcnt_debug_dec(sk);
+}
+
+static int xsk_create(struct net *net, struct socket *sock, int protocol,
+		      int kern)
+{
+	struct sock *sk;
+	struct xdp_sock *xs;
+
+	if (!ns_capable(net->user_ns, CAP_NET_RAW))
+		return -EPERM;
+	if (sock->type != SOCK_RAW)
+		return -ESOCKTNOSUPPORT;
+
+	if (protocol)
+		return -EPROTONOSUPPORT;
+
+	sock->state = SS_UNCONNECTED;
+
+	sk = sk_alloc(net, PF_XDP, GFP_KERNEL, &xsk_proto, kern);
+	if (!sk)
+		return -ENOBUFS;
+
+	sock->ops = &xsk_proto_ops;
+
+	sock_init_data(sock, sk);
+
+	sk->sk_family = PF_XDP;
+
+	sk->sk_destruct = xsk_destruct;
+	sk_refcnt_debug_inc(sk);
+
+	xs = xdp_sk(sk);
+	mutex_init(&xs->mutex);
+
+	local_bh_disable();
+	sock_prot_inuse_add(net, &xsk_proto, 1);
+	local_bh_enable();
+
+	return 0;
+}
+
+static const struct net_proto_family xsk_family_ops = {
+	.family = PF_XDP,
+	.create = xsk_create,
+	.owner	= THIS_MODULE,
+};
+
+static int __init xsk_init(void)
+{
+	int err;
+
+	err = proto_register(&xsk_proto, 0 /* no slab */);
+	if (err)
+		goto out;
+
+	err = sock_register(&xsk_family_ops);
+	if (err)
+		goto out_proto;
+
+	return 0;
+
+out_proto:
+	proto_unregister(&xsk_proto);
+out:
+	return err;
+}
+
+fs_initcall(xsk_init);
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 03/15] xsk: add umem fill queue support and mmap
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, we add another setsockopt for registered user memory (umem)
called XDP_UMEM_FILL_QUEUE. Using this socket option, the process can
ask the kernel to allocate a queue (ring buffer) and also mmap it
(XDP_UMEM_PGOFF_FILL_QUEUE) into the process.

The queue is used to explicitly pass ownership of umem frames from the
user process to the kernel. These frames will in a later patch be
filled in with Rx packet data by the kernel.

v2: Fixed potential crash in xsk_mmap.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 15 +++++++++++
 net/xdp/Makefile            |  2 +-
 net/xdp/xdp_umem.c          |  5 ++++
 net/xdp/xdp_umem.h          |  2 ++
 net/xdp/xsk.c               | 65 ++++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.c         | 58 ++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk_queue.h         | 38 ++++++++++++++++++++++++++
 7 files changed, 183 insertions(+), 2 deletions(-)
 create mode 100644 net/xdp/xsk_queue.c
 create mode 100644 net/xdp/xsk_queue.h

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 41252135a0fe..975661e1baca 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -23,6 +23,7 @@
 
 /* XDP socket options */
 #define XDP_UMEM_REG			3
+#define XDP_UMEM_FILL_RING		4
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -31,4 +32,18 @@ struct xdp_umem_reg {
 	__u32 frame_headroom; /* Frame head room */
 };
 
+/* Pgoff for mmaping the rings */
+#define XDP_UMEM_PGOFF_FILL_RING	0x100000000
+
+struct xdp_ring {
+	__u32 producer __attribute__((aligned(64)));
+	__u32 consumer __attribute__((aligned(64)));
+};
+
+/* Used for the fill and completion queues for buffers */
+struct xdp_umem_ring {
+	struct xdp_ring ptrs;
+	__u32 desc[0] __attribute__((aligned(64)));
+};
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/net/xdp/Makefile b/net/xdp/Makefile
index a5d736640a0f..074fb2b2d51c 100644
--- a/net/xdp/Makefile
+++ b/net/xdp/Makefile
@@ -1,2 +1,2 @@
-obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o
+obj-$(CONFIG_XDP_SOCKETS) += xsk.o xdp_umem.o xsk_queue.o
 
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index ec8b3552be44..e1f627d0cc1c 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -65,6 +65,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
 	struct task_struct *task;
 	struct mm_struct *mm;
 
+	if (umem->fq) {
+		xskq_destroy(umem->fq);
+		umem->fq = NULL;
+	}
+
 	if (umem->pgs) {
 		xdp_umem_unpin_pages(umem);
 
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 4597ae81a221..25634b8a5c6f 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -19,9 +19,11 @@
 #include <linux/if_xdp.h>
 #include <linux/workqueue.h>
 
+#include "xsk_queue.h"
 #include "xdp_umem_props.h"
 
 struct xdp_umem {
+	struct xsk_queue *fq;
 	struct page **pgs;
 	struct xdp_umem_props props;
 	u32 npgs;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 84e0e867febb..da67a3c5c1c9 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -32,6 +32,7 @@
 #include <linux/netdevice.h>
 #include <net/xdp_sock.h>
 
+#include "xsk_queue.h"
 #include "xdp_umem.h"
 
 static struct xdp_sock *xdp_sk(struct sock *sk)
@@ -39,6 +40,21 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
+{
+	struct xsk_queue *q;
+
+	if (entries == 0 || *queue || !is_power_of_2(entries))
+		return -EINVAL;
+
+	q = xskq_create(entries);
+	if (!q)
+		return -ENOMEM;
+
+	*queue = q;
+	return 0;
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
@@ -101,6 +117,23 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return 0;
 	}
+	case XDP_UMEM_FILL_RING:
+	{
+		struct xsk_queue **q;
+		int entries;
+
+		if (!xs->umem)
+			return -EINVAL;
+
+		if (copy_from_user(&entries, optval, sizeof(entries)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		q = &xs->umem->fq;
+		err = xsk_init_queue(entries, q);
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	default:
 		break;
 	}
@@ -108,6 +141,36 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 	return -ENOPROTOOPT;
 }
 
+static int xsk_mmap(struct file *file, struct socket *sock,
+		    struct vm_area_struct *vma)
+{
+	unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
+	unsigned long size = vma->vm_end - vma->vm_start;
+	struct xdp_sock *xs = xdp_sk(sock->sk);
+	struct xsk_queue *q = NULL;
+	unsigned long pfn;
+	struct page *qpg;
+
+	if (!xs->umem)
+		return -EINVAL;
+
+	if (offset == XDP_UMEM_PGOFF_FILL_RING)
+		q = xs->umem->fq;
+	else
+		return -EINVAL;
+
+	if (!q)
+		return -EINVAL;
+
+	qpg = virt_to_head_page(q->ring);
+	if (size > (PAGE_SIZE << compound_order(qpg)))
+		return -EINVAL;
+
+	pfn = virt_to_phys(q->ring) >> PAGE_SHIFT;
+	return remap_pfn_range(vma, vma->vm_start, pfn,
+			       size, vma->vm_page_prot);
+}
+
 static struct proto xsk_proto = {
 	.name =		"XDP",
 	.owner =	THIS_MODULE,
@@ -131,7 +194,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.getsockopt =	sock_no_getsockopt,
 	.sendmsg =	sock_no_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
-	.mmap =		sock_no_mmap,
+	.mmap =		xsk_mmap,
 	.sendpage =	sock_no_sendpage,
 };
 
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
new file mode 100644
index 000000000000..23da4f29d3fb
--- /dev/null
+++ b/net/xdp/xsk_queue.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XDP user-space ring structure
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/slab.h>
+
+#include "xsk_queue.h"
+
+static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
+{
+	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
+}
+
+struct xsk_queue *xskq_create(u32 nentries)
+{
+	struct xsk_queue *q;
+	gfp_t gfp_flags;
+	size_t size;
+
+	q = kzalloc(sizeof(*q), GFP_KERNEL);
+	if (!q)
+		return NULL;
+
+	q->nentries = nentries;
+	q->ring_mask = nentries - 1;
+
+	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
+		    __GFP_COMP  | __GFP_NORETRY;
+	size = xskq_umem_get_ring_size(q);
+
+	q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
+						      get_order(size));
+	if (!q->ring) {
+		kfree(q);
+		return NULL;
+	}
+
+	return q;
+}
+
+void xskq_destroy(struct xsk_queue *q)
+{
+	if (!q)
+		return;
+
+	page_frag_free(q->ring);
+	kfree(q);
+}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
new file mode 100644
index 000000000000..7eb556bf73be
--- /dev/null
+++ b/net/xdp/xsk_queue.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0
+ * XDP user-space ring structure
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_XSK_QUEUE_H
+#define _LINUX_XSK_QUEUE_H
+
+#include <linux/types.h>
+#include <linux/if_xdp.h>
+
+#include "xdp_umem_props.h"
+
+struct xsk_queue {
+	struct xdp_umem_props umem_props;
+	u32 ring_mask;
+	u32 nentries;
+	u32 prod_head;
+	u32 prod_tail;
+	u32 cons_head;
+	u32 cons_tail;
+	struct xdp_ring *ring;
+	u64 invalid_descs;
+};
+
+struct xsk_queue *xskq_create(u32 nentries);
+void xskq_destroy(struct xsk_queue *q);
+
+#endif /* _LINUX_XSK_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 04/15] xsk: add Rx queue setup and mmap support
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Another setsockopt (XDP_RX_QUEUE) is added to let the process allocate
a queue, where the kernel can pass completed Rx frames from the kernel
to user process.

The mmapping of the queue is done using the XDP_PGOFF_RX_QUEUE offset.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp_sock.h      |  4 ++++
 include/uapi/linux/if_xdp.h | 16 ++++++++++++++++
 net/xdp/xsk.c               | 41 ++++++++++++++++++++++++++++++++---------
 net/xdp/xsk_queue.c         | 11 +++++++++--
 net/xdp/xsk_queue.h         |  2 +-
 5 files changed, 62 insertions(+), 12 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 94785f5db13e..db9a321de087 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -18,11 +18,15 @@
 #include <linux/mutex.h>
 #include <net/sock.h>
 
+struct net_device;
+struct xsk_queue;
 struct xdp_umem;
 
 struct xdp_sock {
 	/* struct sock must be the first member of struct xdp_sock */
 	struct sock sk;
+	struct xsk_queue *rx;
+	struct net_device *dev;
 	struct xdp_umem *umem;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 975661e1baca..65324558829d 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -22,6 +22,7 @@
 #include <linux/types.h>
 
 /* XDP socket options */
+#define XDP_RX_RING			1
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
 
@@ -33,13 +34,28 @@ struct xdp_umem_reg {
 };
 
 /* Pgoff for mmaping the rings */
+#define XDP_PGOFF_RX_RING			  0
 #define XDP_UMEM_PGOFF_FILL_RING	0x100000000
 
+struct xdp_desc {
+	__u32 idx;
+	__u32 len;
+	__u16 offset;
+	__u8 flags;
+	__u8 padding[5];
+};
+
 struct xdp_ring {
 	__u32 producer __attribute__((aligned(64)));
 	__u32 consumer __attribute__((aligned(64)));
 };
 
+/* Used for the RX and TX queues for packets */
+struct xdp_rxtx_ring {
+	struct xdp_ring ptrs;
+	struct xdp_desc desc[0] __attribute__((aligned(64)));
+};
+
 /* Used for the fill and completion queues for buffers */
 struct xdp_umem_ring {
 	struct xdp_ring ptrs;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index da67a3c5c1c9..92bd9b7e548f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -31,6 +31,7 @@
 #include <linux/net.h>
 #include <linux/netdevice.h>
 #include <net/xdp_sock.h>
+#include <net/xdp.h>
 
 #include "xsk_queue.h"
 #include "xdp_umem.h"
@@ -40,14 +41,15 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
-static int xsk_init_queue(u32 entries, struct xsk_queue **queue)
+static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
+			  bool umem_queue)
 {
 	struct xsk_queue *q;
 
 	if (entries == 0 || *queue || !is_power_of_2(entries))
 		return -EINVAL;
 
-	q = xskq_create(entries);
+	q = xskq_create(entries, umem_queue);
 	if (!q)
 		return -ENOMEM;
 
@@ -89,6 +91,22 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		return -ENOPROTOOPT;
 
 	switch (optname) {
+	case XDP_RX_RING:
+	{
+		struct xsk_queue **q;
+		int entries;
+
+		if (optlen < sizeof(entries))
+			return -EINVAL;
+		if (copy_from_user(&entries, optval, sizeof(entries)))
+			return -EFAULT;
+
+		mutex_lock(&xs->mutex);
+		q = &xs->rx;
+		err = xsk_init_queue(entries, q, false);
+		mutex_unlock(&xs->mutex);
+		return err;
+	}
 	case XDP_UMEM_REG:
 	{
 		struct xdp_umem_reg mr;
@@ -130,7 +148,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 
 		mutex_lock(&xs->mutex);
 		q = &xs->umem->fq;
-		err = xsk_init_queue(entries, q);
+		err = xsk_init_queue(entries, q, true);
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
@@ -151,13 +169,17 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 	unsigned long pfn;
 	struct page *qpg;
 
-	if (!xs->umem)
-		return -EINVAL;
+	if (offset == XDP_PGOFF_RX_RING) {
+		q = xs->rx;
+	} else {
+		if (!xs->umem)
+			return -EINVAL;
 
-	if (offset == XDP_UMEM_PGOFF_FILL_RING)
-		q = xs->umem->fq;
-	else
-		return -EINVAL;
+		if (offset == XDP_UMEM_PGOFF_FILL_RING)
+			q = xs->umem->fq;
+		else
+			return -EINVAL;
+	}
 
 	if (!q)
 		return -EINVAL;
@@ -205,6 +227,7 @@ static void xsk_destruct(struct sock *sk)
 	if (!sock_flag(sk, SOCK_DEAD))
 		return;
 
+	xskq_destroy(xs->rx);
 	xdp_put_umem(xs->umem);
 
 	sk_refcnt_debug_dec(sk);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 23da4f29d3fb..894f9f89afc7 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -21,7 +21,13 @@ static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
 	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
 }
 
-struct xsk_queue *xskq_create(u32 nentries)
+static u32 xskq_rxtx_get_ring_size(struct xsk_queue *q)
+{
+	return (sizeof(struct xdp_ring) +
+		q->nentries * sizeof(struct xdp_desc));
+}
+
+struct xsk_queue *xskq_create(u32 nentries, bool umem_queue)
 {
 	struct xsk_queue *q;
 	gfp_t gfp_flags;
@@ -36,7 +42,8 @@ struct xsk_queue *xskq_create(u32 nentries)
 
 	gfp_flags = GFP_KERNEL | __GFP_ZERO | __GFP_NOWARN |
 		    __GFP_COMP  | __GFP_NORETRY;
-	size = xskq_umem_get_ring_size(q);
+	size = umem_queue ? xskq_umem_get_ring_size(q) :
+	       xskq_rxtx_get_ring_size(q);
 
 	q->ring = (struct xdp_ring *)__get_free_pages(gfp_flags,
 						      get_order(size));
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 7eb556bf73be..5439fa381763 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -32,7 +32,7 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
-struct xsk_queue *xskq_create(u32 nentries);
+struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q);
 
 #endif /* _LINUX_XSK_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 05/15] xsk: add support for bind for Rx
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, the bind syscall is added. Binding an AF_XDP socket, means
associating the socket to an umem, a netdev and a queue index. This
can be done in two ways.

The first way, creating a "socket from scratch". Create the umem using
the XDP_UMEM_REG setsockopt and an associated fill queue with
XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
setsockopt. Call bind passing ifindex and queue index ("channel" in
ethtool speak).

The second way to bind a socket, is simply skipping the
umem/netdev/queue index, and passing another already setup AF_XDP
socket. The new socket will then have the same umem/netdev/queue index
as the parent so it will share the same umem. You must also set the
flags field in the socket address to XDP_SHARED_UMEM.

v2: Use PTR_ERR instead of passing error variable explicitly.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h      |   1 +
 include/uapi/linux/if_xdp.h |  11 ++++
 net/xdp/xdp_umem.c          |   5 ++
 net/xdp/xdp_umem.h          |   1 +
 net/xdp/xsk.c               | 124 +++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.c         |   8 +++
 net/xdp/xsk_queue.h         |   1 +
 7 files changed, 150 insertions(+), 1 deletion(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index db9a321de087..85d02512f59b 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -28,6 +28,7 @@ struct xdp_sock {
 	struct xsk_queue *rx;
 	struct net_device *dev;
 	struct xdp_umem *umem;
+	u16 queue_id;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 };
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 65324558829d..e5091881f776 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -21,6 +21,17 @@
 
 #include <linux/types.h>
 
+/* Options for the sxdp_flags field */
+#define XDP_SHARED_UMEM 1
+
+struct sockaddr_xdp {
+	__u16 sxdp_family;
+	__u32 sxdp_ifindex;
+	__u32 sxdp_queue_id;
+	__u32 sxdp_shared_umem_fd;
+	__u16 sxdp_flags;
+};
+
 /* XDP socket options */
 #define XDP_RX_RING			1
 #define XDP_UMEM_REG			3
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index e1f627d0cc1c..9bac1ad570fa 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -248,3 +248,8 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	put_pid(umem->pid);
 	return err;
 }
+
+bool xdp_umem_validate_queues(struct xdp_umem *umem)
+{
+	return umem->fq;
+}
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index 25634b8a5c6f..b13133e9c501 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -39,6 +39,7 @@ struct xdp_umem {
 	struct work_struct work;
 };
 
+bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
 void xdp_get_umem(struct xdp_umem *umem);
 void xdp_put_umem(struct xdp_umem *umem);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 92bd9b7e548f..bf2c97b87992 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -57,9 +57,18 @@ static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 	return 0;
 }
 
+static void __xsk_release(struct xdp_sock *xs)
+{
+	/* Wait for driver to stop using the xdp socket. */
+	synchronize_net();
+
+	dev_put(xs->dev);
+}
+
 static int xsk_release(struct socket *sock)
 {
 	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
 	struct net *net;
 
 	if (!sk)
@@ -71,6 +80,11 @@ static int xsk_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	local_bh_enable();
 
+	if (xs->dev) {
+		__xsk_release(xs);
+		xs->dev = NULL;
+	}
+
 	sock_orphan(sk);
 	sock->sk = NULL;
 
@@ -80,6 +94,114 @@ static int xsk_release(struct socket *sock)
 	return 0;
 }
 
+static struct socket *xsk_lookup_xsk_from_fd(int fd)
+{
+	struct socket *sock;
+	int err;
+
+	sock = sockfd_lookup(fd, &err);
+	if (!sock)
+		return ERR_PTR(-ENOTSOCK);
+
+	if (sock->sk->sk_family != PF_XDP) {
+		sockfd_put(sock);
+		return ERR_PTR(-ENOPROTOOPT);
+	}
+
+	return sock;
+}
+
+static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
+{
+	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
+	struct sock *sk = sock->sk;
+	struct net_device *dev, *dev_curr;
+	struct xdp_sock *xs = xdp_sk(sk);
+	struct xdp_umem *old_umem = NULL;
+	int err = 0;
+
+	if (addr_len < sizeof(struct sockaddr_xdp))
+		return -EINVAL;
+	if (sxdp->sxdp_family != AF_XDP)
+		return -EINVAL;
+
+	mutex_lock(&xs->mutex);
+	dev_curr = xs->dev;
+	dev = dev_get_by_index(sock_net(sk), sxdp->sxdp_ifindex);
+	if (!dev) {
+		err = -ENODEV;
+		goto out_release;
+	}
+
+	if (!xs->rx) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (sxdp->sxdp_queue_id >= dev->num_rx_queues) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	if (sxdp->sxdp_flags & XDP_SHARED_UMEM) {
+		struct xdp_sock *umem_xs;
+		struct socket *sock;
+
+		if (xs->umem) {
+			/* We have already our own. */
+			err = -EINVAL;
+			goto out_unlock;
+		}
+
+		sock = xsk_lookup_xsk_from_fd(sxdp->sxdp_shared_umem_fd);
+		if (IS_ERR(sock)) {
+			err = PTR_ERR(sock);
+			goto out_unlock;
+		}
+
+		umem_xs = xdp_sk(sock->sk);
+		if (!umem_xs->umem) {
+			/* No umem to inherit. */
+			err = -EBADF;
+			sockfd_put(sock);
+			goto out_unlock;
+		} else if (umem_xs->dev != dev ||
+			   umem_xs->queue_id != sxdp->sxdp_queue_id) {
+			err = -EINVAL;
+			sockfd_put(sock);
+			goto out_unlock;
+		}
+
+		xdp_get_umem(umem_xs->umem);
+		old_umem = xs->umem;
+		xs->umem = umem_xs->umem;
+		sockfd_put(sock);
+	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
+		err = -EINVAL;
+		goto out_unlock;
+	}
+
+	/* Rebind? */
+	if (dev_curr && (dev_curr != dev ||
+			 xs->queue_id != sxdp->sxdp_queue_id)) {
+		__xsk_release(xs);
+		if (old_umem)
+			xdp_put_umem(old_umem);
+	}
+
+	xs->dev = dev;
+	xs->queue_id = sxdp->sxdp_queue_id;
+
+	xskq_set_umem(xs->rx, &xs->umem->props);
+
+out_unlock:
+	if (err)
+		dev_put(dev);
+out_release:
+	mutex_unlock(&xs->mutex);
+	return err;
+}
+
 static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			  char __user *optval, unsigned int optlen)
 {
@@ -203,7 +325,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.family =	PF_XDP,
 	.owner =	THIS_MODULE,
 	.release =	xsk_release,
-	.bind =		sock_no_bind,
+	.bind =		xsk_bind,
 	.connect =	sock_no_connect,
 	.socketpair =	sock_no_socketpair,
 	.accept =	sock_no_accept,
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 894f9f89afc7..d012e5e23591 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -16,6 +16,14 @@
 
 #include "xsk_queue.h"
 
+void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props)
+{
+	if (!q)
+		return;
+
+	q->umem_props = *umem_props;
+}
+
 static u32 xskq_umem_get_ring_size(struct xsk_queue *q)
 {
 	return sizeof(struct xdp_umem_ring) + q->nentries * sizeof(u32);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 5439fa381763..9ddd2ee07a84 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -32,6 +32,7 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
+void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props);
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q);
 
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 06/15] xsk: add Rx receive functions and poll support
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Here the actual receive functions of AF_XDP are implemented, that in a
later commit, will be called from the XDP layers.

There's one set of functions for the XDP_DRV side and another for
XDP_SKB (generic).

A new XDP API, xdp_return_buff, is also introduced.

Adding xdp_return_buff, which is analogous to xdp_return_frame, but
acts upon an struct xdp_buff. The API will be used by AF_XDP in future
commits.

Support for the poll syscall is also implemented.

v2: xskq_validate_id did not update cons_tail.
    The entries variable was calculated twice in xskq_nb_avail.
    Squashed xdp_return_buff commit.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/net/xdp.h      |   1 +
 include/net/xdp_sock.h |  22 ++++++++++
 net/core/xdp.c         |  15 +++++--
 net/xdp/xdp_umem.h     |  18 ++++++++
 net/xdp/xsk.c          |  73 ++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.h    | 114 ++++++++++++++++++++++++++++++++++++++++++++++++-
 6 files changed, 238 insertions(+), 5 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 137ad5f9f40f..0b689cf561c7 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -104,6 +104,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 }
 
 void xdp_return_frame(struct xdp_frame *xdpf);
+void xdp_return_buff(struct xdp_buff *xdp);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 85d02512f59b..ea54483f64bb 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -28,9 +28,31 @@ struct xdp_sock {
 	struct xsk_queue *rx;
 	struct net_device *dev;
 	struct xdp_umem *umem;
+	u64 rx_dropped;
 	u16 queue_id;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 };
 
+struct xdp_buff;
+#ifdef CONFIG_XDP_SOCKETS
+int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
+void xsk_flush(struct xdp_sock *xs);
+#else
+static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	return -ENOTSUPP;
+}
+
+static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	return -ENOTSUPP;
+}
+
+static inline void xsk_flush(struct xdp_sock *xs)
+{
+}
+#endif /* CONFIG_XDP_SOCKETS */
+
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 0c86b53a3a63..bf6758f74339 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -308,11 +308,9 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-void xdp_return_frame(struct xdp_frame *xdpf)
+static void xdp_return(void *data, struct xdp_mem_info *mem)
 {
-	struct xdp_mem_info *mem = &xdpf->mem;
 	struct xdp_mem_allocator *xa;
-	void *data = xdpf->data;
 	struct page *page;
 
 	switch (mem->type) {
@@ -339,4 +337,15 @@ void xdp_return_frame(struct xdp_frame *xdpf)
 		break;
 	}
 }
+
+void xdp_return_frame(struct xdp_frame *xdpf)
+{
+	xdp_return(xdpf->data, &xdpf->mem);
+}
 EXPORT_SYMBOL_GPL(xdp_return_frame);
+
+void xdp_return_buff(struct xdp_buff *xdp)
+{
+	xdp_return(xdp->data, &xdp->rxq->mem);
+}
+EXPORT_SYMBOL_GPL(xdp_return_buff);
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index b13133e9c501..c7378a11721f 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -39,6 +39,24 @@ struct xdp_umem {
 	struct work_struct work;
 };
 
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u32 idx)
+{
+	u64 pg, off;
+	char *data;
+
+	pg = idx >> umem->nfpplog2;
+	off = (idx & umem->nfpp_mask) << umem->frame_size_log2;
+
+	data = page_address(umem->pgs[pg]);
+	return data + off;
+}
+
+static inline char *xdp_umem_get_data_with_headroom(struct xdp_umem *umem,
+						    u32 idx)
+{
+	return xdp_umem_get_data(umem, idx) + umem->frame_headroom;
+}
+
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
 int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr);
 void xdp_get_umem(struct xdp_umem *umem);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index bf2c97b87992..4e1e6c581e1d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -41,6 +41,74 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	u32 *id, len = xdp->data_end - xdp->data;
+	void *buffer;
+	int err = 0;
+
+	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+		return -EINVAL;
+
+	id = xskq_peek_id(xs->umem->fq);
+	if (!id)
+		return -ENOSPC;
+
+	buffer = xdp_umem_get_data_with_headroom(xs->umem, *id);
+	memcpy(buffer, xdp->data, len);
+	err = xskq_produce_batch_desc(xs->rx, *id, len,
+				      xs->umem->frame_headroom);
+	if (!err)
+		xskq_discard_id(xs->umem->fq);
+
+	return err;
+}
+
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	int err;
+
+	err = __xsk_rcv(xs, xdp);
+	if (likely(!err))
+		xdp_return_buff(xdp);
+	else
+		xs->rx_dropped++;
+
+	return err;
+}
+
+void xsk_flush(struct xdp_sock *xs)
+{
+	xskq_produce_flush_desc(xs->rx);
+	xs->sk.sk_data_ready(&xs->sk);
+}
+
+int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	int err;
+
+	err = __xsk_rcv(xs, xdp);
+	if (!err)
+		xsk_flush(xs);
+	else
+		xs->rx_dropped++;
+
+	return err;
+}
+
+static unsigned int xsk_poll(struct file *file, struct socket *sock,
+			     struct poll_table_struct *wait)
+{
+	unsigned int mask = datagram_poll(file, sock, wait);
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (xs->rx && !xskq_empty_desc(xs->rx))
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
 static int xsk_init_queue(u32 entries, struct xsk_queue **queue,
 			  bool umem_queue)
 {
@@ -179,6 +247,9 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	} else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) {
 		err = -EINVAL;
 		goto out_unlock;
+	} else {
+		/* This xsk has its own umem. */
+		xskq_set_umem(xs->umem->fq, &xs->umem->props);
 	}
 
 	/* Rebind? */
@@ -330,7 +401,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.socketpair =	sock_no_socketpair,
 	.accept =	sock_no_accept,
 	.getname =	sock_no_getname,
-	.poll =		sock_no_poll,
+	.poll =		xsk_poll,
 	.ioctl =	sock_no_ioctl,
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 9ddd2ee07a84..0a9b92b4f93a 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -20,6 +20,8 @@
 
 #include "xdp_umem_props.h"
 
+#define RX_BATCH_SIZE 16
+
 struct xsk_queue {
 	struct xdp_umem_props umem_props;
 	u32 ring_mask;
@@ -32,8 +34,118 @@ struct xsk_queue {
 	u64 invalid_descs;
 };
 
+/* Common functions operating for both RXTX and umem queues */
+
+static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
+{
+	u32 entries = q->prod_tail - q->cons_tail;
+
+	if (entries == 0) {
+		/* Refresh the local pointer */
+		q->prod_tail = READ_ONCE(q->ring->producer);
+		entries = q->prod_tail - q->cons_tail;
+	}
+
+	return (entries > dcnt) ? dcnt : entries;
+}
+
+static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
+{
+	u32 free_entries = q->nentries - (producer - q->cons_tail);
+
+	if (free_entries >= dcnt)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cons_tail = READ_ONCE(q->ring->consumer);
+	return q->nentries - (producer - q->cons_tail);
+}
+
+/* UMEM queue */
+
+static inline bool xskq_is_valid_id(struct xsk_queue *q, u32 idx)
+{
+	if (unlikely(idx >= q->umem_props.nframes)) {
+		q->invalid_descs++;
+		return false;
+	}
+	return true;
+}
+
+static inline u32 *xskq_validate_id(struct xsk_queue *q)
+{
+	while (q->cons_tail != q->cons_head) {
+		struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+		unsigned int idx = q->cons_tail & q->ring_mask;
+
+		if (xskq_is_valid_id(q, ring->desc[idx]))
+			return &ring->desc[idx];
+
+		q->cons_tail++;
+	}
+
+	return NULL;
+}
+
+static inline u32 *xskq_peek_id(struct xsk_queue *q)
+{
+	struct xdp_umem_ring *ring;
+
+	if (q->cons_tail == q->cons_head) {
+		WRITE_ONCE(q->ring->consumer, q->cons_tail);
+		q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
+
+		/* Order consumer and data */
+		smp_rmb();
+
+		return xskq_validate_id(q);
+	}
+
+	ring = (struct xdp_umem_ring *)q->ring;
+	return &ring->desc[q->cons_tail & q->ring_mask];
+}
+
+static inline void xskq_discard_id(struct xsk_queue *q)
+{
+	q->cons_tail++;
+	(void)xskq_validate_id(q);
+}
+
+/* Rx queue */
+
+static inline int xskq_produce_batch_desc(struct xsk_queue *q,
+					  u32 id, u32 len, u16 offset)
+{
+	struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
+	unsigned int idx;
+
+	if (xskq_nb_free(q, q->prod_head, 1) == 0)
+		return -ENOSPC;
+
+	idx = (q->prod_head++) & q->ring_mask;
+	ring->desc[idx].idx = id;
+	ring->desc[idx].len = len;
+	ring->desc[idx].offset = offset;
+
+	return 0;
+}
+
+static inline void xskq_produce_flush_desc(struct xsk_queue *q)
+{
+	/* Order producer and data */
+	smp_wmb();
+
+	q->prod_tail = q->prod_head,
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+}
+
+static inline bool xskq_empty_desc(struct xsk_queue *q)
+{
+	return (xskq_nb_free(q, q->prod_tail, 1) == q->nentries);
+}
+
 void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props);
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
-void xskq_destroy(struct xsk_queue *q);
+void xskq_destroy(struct xsk_queue *q_ops);
 
 #endif /* _LINUX_XSK_QUEUE_H */
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

The xskmap is yet another BPF map, very much inspired by
dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
adds AF_XDP sockets into the map, and by using the bpf_redirect_map
helper, an XDP program can redirect XDP frames to an AF_XDP socket.

Note that a socket that is bound to certain ifindex/queue index will
*only* accept XDP frames from that netdev/queue index. If an XDP
program tries to redirect from a netdev/queue index other than what
the socket is bound to, the frame will not be received on the socket.

A socket can reside in multiple maps.

v2: Removed one indirection in map lookup.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/bpf.h       |  26 +++++
 include/linux/bpf_types.h |   3 +
 include/net/xdp_sock.h    |   7 ++
 include/uapi/linux/bpf.h  |   1 +
 kernel/bpf/Makefile       |   3 +
 kernel/bpf/verifier.c     |   8 +-
 kernel/bpf/xskmap.c       | 272 ++++++++++++++++++++++++++++++++++++++++++++++
 net/xdp/xsk.c             |   5 +
 8 files changed, 323 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/xskmap.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 38ebbc61ed99..6b62509777a2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -676,6 +676,32 @@ static inline int sock_map_prog(struct bpf_map *map,
 }
 #endif
 
+#if defined(CONFIG_XDP_SOCKETS)
+struct xdp_sock;
+struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key);
+int __xsk_map_redirect(struct bpf_map *map, u32 index,
+		       struct xdp_buff *xdp, struct xdp_sock *xs);
+void __xsk_map_flush(struct bpf_map *map);
+#else
+struct xdp_sock;
+static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map,
+						     u32 key)
+{
+	return NULL;
+}
+
+static inline int __xsk_map_redirect(struct bpf_map *map, u32 index,
+				     struct xdp_buff *xdp,
+				     struct xdp_sock *xs)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void __xsk_map_flush(struct bpf_map *map)
+{
+}
+#endif
+
 /* verifier prototypes for helper functions called from eBPF programs */
 extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern const struct bpf_func_proto bpf_map_update_elem_proto;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b28fcf6f6ae..d7df1b323082 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -49,4 +49,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
+#if defined(CONFIG_XDP_SOCKETS)
+BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
+#endif
 #endif
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index ea54483f64bb..678a1d293579 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -28,6 +28,7 @@ struct xdp_sock {
 	struct xsk_queue *rx;
 	struct net_device *dev;
 	struct xdp_umem *umem;
+	struct rcu_head rcu;
 	u64 rx_dropped;
 	u16 queue_id;
 	/* Protects multiple processes in the control path */
@@ -39,6 +40,7 @@ struct xdp_buff;
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
+bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
@@ -53,6 +55,11 @@ static inline int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 static inline void xsk_flush(struct xdp_sock *xs)
 {
 }
+
+static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
+{
+	return false;
+}
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index da77a9388947..7c16bc197ff5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -116,6 +116,7 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_DEVMAP,
 	BPF_MAP_TYPE_SOCKMAP,
 	BPF_MAP_TYPE_CPUMAP,
+	BPF_MAP_TYPE_XSKMAP,
 };
 
 enum bpf_prog_type {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 35c485fa9ea3..f27f5496d6fe 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -8,6 +8,9 @@ obj-$(CONFIG_BPF_SYSCALL) += btf.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
 obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
+ifeq ($(CONFIG_XDP_SOCKETS),y)
+obj-$(CONFIG_BPF_SYSCALL) += xskmap.o
+endif
 obj-$(CONFIG_BPF_SYSCALL) += offload.o
 ifeq ($(CONFIG_STREAM_PARSER),y)
 ifeq ($(CONFIG_INET),y)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index eb1a596aebd3..0ac49d62dc0c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2061,8 +2061,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (func_id != BPF_FUNC_redirect_map)
 			goto error;
 		break;
-	/* Restrict bpf side of cpumap, open when use-cases appear */
+	/* Restrict bpf side of cpumap and xskmap, open when use-cases
+	 * appear.
+	 */
 	case BPF_MAP_TYPE_CPUMAP:
+	case BPF_MAP_TYPE_XSKMAP:
 		if (func_id != BPF_FUNC_redirect_map)
 			goto error;
 		break;
@@ -2109,7 +2112,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_FUNC_redirect_map:
 		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
-		    map->map_type != BPF_MAP_TYPE_CPUMAP)
+		    map->map_type != BPF_MAP_TYPE_CPUMAP &&
+		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
 	case BPF_FUNC_sk_redirect_map:
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
new file mode 100644
index 000000000000..eb60824af070
--- /dev/null
+++ b/kernel/bpf/xskmap.c
@@ -0,0 +1,272 @@
+// SPDX-License-Identifier: GPL-2.0
+/* XSKMAP used for AF_XDP sockets
+ * Copyright(c) 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/bpf.h>
+#include <linux/capability.h>
+#include <net/xdp_sock.h>
+#include <linux/slab.h>
+#include <linux/sched.h>
+
+struct xsk_map {
+	struct bpf_map map;
+	struct xdp_sock **xsk_map;
+	unsigned long __percpu *flush_needed;
+};
+
+static u64 xsk_map_bitmap_size(const union bpf_attr *attr)
+{
+	return BITS_TO_LONGS((u64) attr->max_entries) * sizeof(unsigned long);
+}
+
+static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
+{
+	struct xsk_map *m;
+	int err = -EINVAL;
+	u64 cost;
+
+	if (!capable(CAP_NET_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if (attr->max_entries == 0 || attr->key_size != 4 ||
+	    attr->value_size != 4 ||
+	    attr->map_flags & ~(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY))
+		return ERR_PTR(-EINVAL);
+
+	m = kzalloc(sizeof(*m), GFP_USER);
+	if (!m)
+		return ERR_PTR(-ENOMEM);
+
+	bpf_map_init_from_attr(&m->map, attr);
+
+	cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *);
+	cost += xsk_map_bitmap_size(attr) * num_possible_cpus();
+	if (cost >= U32_MAX - PAGE_SIZE)
+		goto free_m;
+
+	m->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
+
+	/* Notice returns -EPERM on if map size is larger than memlock limit */
+	err = bpf_map_precharge_memlock(m->map.pages);
+	if (err)
+		goto free_m;
+
+	m->flush_needed = __alloc_percpu(xsk_map_bitmap_size(attr),
+					    __alignof__(unsigned long));
+	if (!m->flush_needed)
+		goto free_m;
+
+	m->xsk_map = bpf_map_area_alloc(m->map.max_entries *
+					sizeof(struct xdp_sock *),
+					m->map.numa_node);
+	if (!m->xsk_map)
+		goto free_percpu;
+	return &m->map;
+
+free_percpu:
+	free_percpu(m->flush_needed);
+free_m:
+	kfree(m);
+	return ERR_PTR(err);
+}
+
+static void xsk_map_free(struct bpf_map *map)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	int i, cpu;
+
+	/* At this point bpf_prog->aux->refcnt == 0 and this
+	 * map->refcnt == 0, so the programs (can be more than one
+	 * that used this map) were disconnected from events. Wait for
+	 * outstanding critical sections in these programs to
+	 * complete. The rcu critical section only guarantees no
+	 * further reads against xsk_map. It does __not__ ensure
+	 * pending flush operations (if any) are complete.
+	 */
+
+	synchronize_rcu();
+
+	/* To ensure all pending flush operations have completed wait
+	 * for flush bitmap to indicate all flush_needed bits to be
+	 * zero on _all_ cpus.  Because the above synchronize_rcu()
+	 * ensures the map is disconnected from the program we can
+	 * assume no new bits will be set.
+	 */
+	for_each_online_cpu(cpu) {
+		unsigned long *bitmap = per_cpu_ptr(m->flush_needed, cpu);
+
+		while (!bitmap_empty(bitmap, map->max_entries))
+			cond_resched();
+	}
+
+	for (i = 0; i < map->max_entries; i++) {
+		struct xdp_sock *xs;
+
+		xs = m->xsk_map[i];
+		if (!xs)
+			continue;
+
+		sock_put((struct sock *)xs);
+	}
+
+	free_percpu(m->flush_needed);
+	bpf_map_area_free(m->xsk_map);
+	kfree(m);
+}
+
+static int xsk_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	u32 index = key ? *(u32 *)key : U32_MAX;
+	u32 *next = next_key;
+
+	if (index >= m->map.max_entries) {
+		*next = 0;
+		return 0;
+	}
+
+	if (index == m->map.max_entries - 1)
+		return -ENOENT;
+	*next = index + 1;
+	return 0;
+}
+
+struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xdp_sock *xs;
+
+	if (key >= map->max_entries)
+		return NULL;
+
+	xs = READ_ONCE(m->xsk_map[key]);
+	return xs;
+}
+
+int __xsk_map_redirect(struct bpf_map *map, u32 index,
+		       struct xdp_buff *xdp, struct xdp_sock *xs)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	unsigned long *bitmap = this_cpu_ptr(m->flush_needed);
+	int err;
+
+	err = xsk_rcv(xs, xdp);
+	if (err)
+		return err;
+
+	__set_bit(index, bitmap);
+	return 0;
+}
+
+void __xsk_map_flush(struct bpf_map *map)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	unsigned long *bitmap = this_cpu_ptr(m->flush_needed);
+	u32 bit;
+
+	for_each_set_bit(bit, bitmap, map->max_entries) {
+		struct xdp_sock *xs = READ_ONCE(m->xsk_map[bit]);
+
+		/* This is possible if the entry is removed by user
+		 * space between xdp redirect and flush op.
+		 */
+		if (unlikely(!xs))
+			continue;
+
+		__clear_bit(bit, bitmap);
+		xsk_flush(xs);
+	}
+}
+
+static void *xsk_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	return NULL;
+}
+
+static void __xsk_map_entry_free(struct rcu_head *rcu)
+{
+	struct xdp_sock *xs;
+
+	xs = container_of(rcu, struct xdp_sock, rcu);
+	xsk_flush(xs);
+	sock_put((struct sock *)xs);
+}
+
+static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
+			       u64 map_flags)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	u32 i = *(u32 *)key, fd = *(u32 *)value;
+	struct xdp_sock *xs, *old_xs;
+	struct socket *sock;
+	int err;
+
+	if (unlikely(map_flags > BPF_EXIST))
+		return -EINVAL;
+	if (unlikely(i >= m->map.max_entries))
+		return -E2BIG;
+	if (unlikely(map_flags == BPF_NOEXIST))
+		return -EEXIST;
+
+	sock = sockfd_lookup(fd, &err);
+	if (!sock)
+		return err;
+
+	if (sock->sk->sk_family != PF_XDP) {
+		sockfd_put(sock);
+		return -EOPNOTSUPP;
+	}
+
+	xs = (struct xdp_sock *)sock->sk;
+
+	if (!xsk_is_setup_for_bpf_map(xs)) {
+		sockfd_put(sock);
+		return -EOPNOTSUPP;
+	}
+
+	sock_hold(sock->sk);
+
+	old_xs = xchg(&m->xsk_map[i], xs);
+	if (old_xs)
+		call_rcu(&old_xs->rcu, __xsk_map_entry_free);
+
+	sockfd_put(sock);
+	return 0;
+}
+
+static int xsk_map_delete_elem(struct bpf_map *map, void *key)
+{
+	struct xsk_map *m = container_of(map, struct xsk_map, map);
+	struct xdp_sock *old_xs;
+	int k = *(u32 *)key;
+
+	if (k >= map->max_entries)
+		return -EINVAL;
+
+	old_xs = xchg(&m->xsk_map[k], NULL);
+	if (old_xs)
+		call_rcu(&old_xs->rcu, __xsk_map_entry_free);
+
+	return 0;
+}
+
+const struct bpf_map_ops xsk_map_ops = {
+	.map_alloc = xsk_map_alloc,
+	.map_free = xsk_map_free,
+	.map_get_next_key = xsk_map_get_next_key,
+	.map_lookup_elem = xsk_map_lookup_elem,
+	.map_update_elem = xsk_map_update_elem,
+	.map_delete_elem = xsk_map_delete_elem,
+};
+
+
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4e1e6c581e1d..b931a0db5588 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -41,6 +41,11 @@ static struct xdp_sock *xdp_sk(struct sock *sk)
 	return (struct xdp_sock *)sk;
 }
 
+bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
+{
+	return !!xs->rx;
+}
+
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	u32 *id, len = xdp->data_end - xdp->data;
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 08/15] xsk: wire up XDP_DRV side of AF_XDP
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

This commit wires up the xskmap to XDP_DRV layer.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/core/filter.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index d3781daa26ab..1f2726318725 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2801,7 +2801,8 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 {
 	int err;
 
-	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP: {
 		struct net_device *dev = fwd;
 		struct xdp_frame *xdpf;
 
@@ -2819,14 +2820,25 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 		if (err)
 			return err;
 		__dev_map_insert_ctx(map, index);
-
-	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP) {
+		break;
+	}
+	case BPF_MAP_TYPE_CPUMAP: {
 		struct bpf_cpu_map_entry *rcpu = fwd;
 
 		err = cpu_map_enqueue(rcpu, xdp, dev_rx);
 		if (err)
 			return err;
 		__cpu_map_insert_ctx(map, index);
+		break;
+	}
+	case BPF_MAP_TYPE_XSKMAP: {
+		struct xdp_sock *xs = fwd;
+
+		err = __xsk_map_redirect(map, index, xdp, xs);
+		return err;
+	}
+	default:
+		break;
 	}
 	return 0;
 }
@@ -2845,6 +2857,9 @@ void xdp_do_flush_map(void)
 		case BPF_MAP_TYPE_CPUMAP:
 			__cpu_map_flush(map);
 			break;
+		case BPF_MAP_TYPE_XSKMAP:
+			__xsk_map_flush(map);
+			break;
 		default:
 			break;
 		}
@@ -2859,6 +2874,8 @@ static void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
 		return __dev_map_lookup_elem(map, index);
 	case BPF_MAP_TYPE_CPUMAP:
 		return __cpu_map_lookup_elem(map, index);
+	case BPF_MAP_TYPE_XSKMAP:
+		return __xsk_map_lookup_elem(map, index);
 	default:
 		return NULL;
 	}
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 09/15] xsk: wire up XDP_SKB side of AF_XDP
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: Björn Töpel, michael.lundkvist, jesse.brandeburg,
	anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

This commit wires up the xskmap to XDP_SKB layer.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/filter.h |  2 +-
 net/core/dev.c         | 35 +++++++++++++++++++----------------
 net/core/filter.c      | 17 ++++++++++++++---
 3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b2308174..6ab9a6765b00 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -759,7 +759,7 @@ struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
  * This does not appear to be a real limitation for existing software.
  */
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
-			    struct bpf_prog *prog);
+			    struct xdp_buff *xdp, struct bpf_prog *prog);
 int xdp_do_redirect(struct net_device *dev,
 		    struct xdp_buff *xdp,
 		    struct bpf_prog *prog);
diff --git a/net/core/dev.c b/net/core/dev.c
index 8f8931b93140..aea36b5a2fed 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3994,12 +3994,12 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
 }
 
 static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+				     struct xdp_buff *xdp,
 				     struct bpf_prog *xdp_prog)
 {
 	struct netdev_rx_queue *rxqueue;
 	void *orig_data, *orig_data_end;
 	u32 metalen, act = XDP_DROP;
-	struct xdp_buff xdp;
 	int hlen, off;
 	u32 mac_len;
 
@@ -4034,19 +4034,19 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	 */
 	mac_len = skb->data - skb_mac_header(skb);
 	hlen = skb_headlen(skb) + mac_len;
-	xdp.data = skb->data - mac_len;
-	xdp.data_meta = xdp.data;
-	xdp.data_end = xdp.data + hlen;
-	xdp.data_hard_start = skb->data - skb_headroom(skb);
-	orig_data_end = xdp.data_end;
-	orig_data = xdp.data;
+	xdp->data = skb->data - mac_len;
+	xdp->data_meta = xdp->data;
+	xdp->data_end = xdp->data + hlen;
+	xdp->data_hard_start = skb->data - skb_headroom(skb);
+	orig_data_end = xdp->data_end;
+	orig_data = xdp->data;
 
 	rxqueue = netif_get_rxqueue(skb);
-	xdp.rxq = &rxqueue->xdp_rxq;
+	xdp->rxq = &rxqueue->xdp_rxq;
 
-	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
-	off = xdp.data - orig_data;
+	off = xdp->data - orig_data;
 	if (off > 0)
 		__skb_pull(skb, off);
 	else if (off < 0)
@@ -4056,10 +4056,11 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	/* check if bpf_xdp_adjust_tail was used. it can only "shrink"
 	 * pckt.
 	 */
-	off = orig_data_end - xdp.data_end;
+	off = orig_data_end - xdp->data_end;
 	if (off != 0) {
-		skb_set_tail_pointer(skb, xdp.data_end - xdp.data);
+		skb_set_tail_pointer(skb, xdp->data_end - xdp->data);
 		skb->len -= off;
+
 	}
 
 	switch (act) {
@@ -4068,7 +4069,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 		__skb_push(skb, mac_len);
 		break;
 	case XDP_PASS:
-		metalen = xdp.data - xdp.data_meta;
+		metalen = xdp->data - xdp->data_meta;
 		if (metalen)
 			skb_metadata_set(skb, metalen);
 		break;
@@ -4118,17 +4119,19 @@ static struct static_key generic_xdp_needed __read_mostly;
 int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
 {
 	if (xdp_prog) {
-		u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+		struct xdp_buff xdp;
+		u32 act;
 		int err;
 
+		act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
 		if (act != XDP_PASS) {
 			switch (act) {
 			case XDP_REDIRECT:
 				err = xdp_do_generic_redirect(skb->dev, skb,
-							      xdp_prog);
+							      &xdp, xdp_prog);
 				if (err)
 					goto out_redir;
-			/* fallthru to submit skb */
+				break;
 			case XDP_TX:
 				generic_xdp_tx(skb, xdp_prog);
 				break;
diff --git a/net/core/filter.c b/net/core/filter.c
index 1f2726318725..e20956ebf664 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -59,6 +59,7 @@
 #include <net/tcp.h>
 #include <net/xfrm.h>
 #include <linux/bpf_trace.h>
+#include <net/xdp_sock.h>
 
 /**
  *	sk_filter_trim_cap - run a packet through a socket filter
@@ -2973,13 +2974,14 @@ static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
 
 static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
+				       struct xdp_buff *xdp,
 				       struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	unsigned long map_owner = ri->map_owner;
 	struct bpf_map *map = ri->map;
-	struct net_device *fwd = NULL;
 	u32 index = ri->ifindex;
+	void *fwd = NULL;
 	int err = 0;
 
 	ri->ifindex = 0;
@@ -3001,6 +3003,14 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 		if (unlikely((err = __xdp_generic_ok_fwd_dev(skb, fwd))))
 			goto err;
 		skb->dev = fwd;
+		generic_xdp_tx(skb, xdp_prog);
+	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
+		struct xdp_sock *xs = fwd;
+
+		err = xsk_generic_rcv(xs, xdp);
+		if (err)
+			goto err;
+		consume_skb(skb);
 	} else {
 		/* TODO: Handle BPF_MAP_TYPE_CPUMAP */
 		err = -EBADRQC;
@@ -3015,7 +3025,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 }
 
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
-			    struct bpf_prog *xdp_prog)
+			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
 	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
 	u32 index = ri->ifindex;
@@ -3023,7 +3033,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 	int err = 0;
 
 	if (ri->map)
-		return xdp_do_generic_redirect_map(dev, skb, xdp_prog);
+		return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog);
 
 	ri->ifindex = 0;
 	fwd = dev_get_by_index_rcu(dev_net(dev), index);
@@ -3037,6 +3047,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 	skb->dev = fwd;
 	_trace_xdp_redirect(dev, xdp_prog, index);
+	generic_xdp_tx(skb, xdp_prog);
 	return 0;
 err:
 	_trace_xdp_redirect_err(dev, xdp_prog, index, err);
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 10/15] xsk: add umem completion queue support and mmap
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, we add another setsockopt for registered user memory (umem)
called XDP_UMEM_COMPLETION_QUEUE. Using this socket option, the
process can ask the kernel to allocate a queue (ring buffer) and also
mmap it (XDP_UMEM_PGOFF_COMPLETION_QUEUE) into the process.

The queue is used to explicitly pass ownership of umem frames from the
kernel to user process. This will be used by the TX path to tell user
space that a certain frame has been transmitted and user space can use
it for something else, if it wishes.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h | 2 ++
 net/xdp/xdp_umem.c          | 7 ++++++-
 net/xdp/xdp_umem.h          | 1 +
 net/xdp/xsk.c               | 7 ++++++-
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index e5091881f776..71581a139f26 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -36,6 +36,7 @@ struct sockaddr_xdp {
 #define XDP_RX_RING			1
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
+#define XDP_UMEM_COMPLETION_RING	5
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -47,6 +48,7 @@ struct xdp_umem_reg {
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
 #define XDP_UMEM_PGOFF_FILL_RING	0x100000000
+#define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000
 
 struct xdp_desc {
 	__u32 idx;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 9bac1ad570fa..881dfdefe235 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -70,6 +70,11 @@ static void xdp_umem_release(struct xdp_umem *umem)
 		umem->fq = NULL;
 	}
 
+	if (umem->cq) {
+		xskq_destroy(umem->cq);
+		umem->cq = NULL;
+	}
+
 	if (umem->pgs) {
 		xdp_umem_unpin_pages(umem);
 
@@ -251,5 +256,5 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 
 bool xdp_umem_validate_queues(struct xdp_umem *umem)
 {
-	return umem->fq;
+	return (umem->fq && umem->cq);
 }
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index c7378a11721f..7e0b2fab8522 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -24,6 +24,7 @@
 
 struct xdp_umem {
 	struct xsk_queue *fq;
+	struct xsk_queue *cq;
 	struct page **pgs;
 	struct xdp_umem_props props;
 	u32 npgs;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b931a0db5588..f4a2c5bc6da9 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -255,6 +255,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	} else {
 		/* This xsk has its own umem. */
 		xskq_set_umem(xs->umem->fq, &xs->umem->props);
+		xskq_set_umem(xs->umem->cq, &xs->umem->props);
 	}
 
 	/* Rebind? */
@@ -334,6 +335,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		return 0;
 	}
 	case XDP_UMEM_FILL_RING:
+	case XDP_UMEM_COMPLETION_RING:
 	{
 		struct xsk_queue **q;
 		int entries;
@@ -345,7 +347,8 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			return -EFAULT;
 
 		mutex_lock(&xs->mutex);
-		q = &xs->umem->fq;
+		q = (optname == XDP_UMEM_FILL_RING) ? &xs->umem->fq :
+			&xs->umem->cq;
 		err = xsk_init_queue(entries, q, true);
 		mutex_unlock(&xs->mutex);
 		return err;
@@ -375,6 +378,8 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 
 		if (offset == XDP_UMEM_PGOFF_FILL_RING)
 			q = xs->umem->fq;
+		else if (offset == XDP_UMEM_PGOFF_COMPLETION_RING)
+			q = xs->umem->cq;
 		else
 			return -EINVAL;
 	}
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 11/15] xsk: add Tx queue setup and mmap support
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Another setsockopt (XDP_TX_QUEUE) is added to let the process allocate
a queue, where the user process can pass frames to be transmitted by
the kernel.

The mmapping of the queue is done using the XDP_PGOFF_TX_QUEUE offset.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/net/xdp_sock.h      | 1 +
 include/uapi/linux/if_xdp.h | 2 ++
 net/xdp/xsk.c               | 8 ++++++--
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 678a1d293579..759e04b991c5 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -31,6 +31,7 @@ struct xdp_sock {
 	struct rcu_head rcu;
 	u64 rx_dropped;
 	u16 queue_id;
+	struct xsk_queue *tx ____cacheline_aligned_in_smp;
 	/* Protects multiple processes in the control path */
 	struct mutex mutex;
 };
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 71581a139f26..e2ea878d025c 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -34,6 +34,7 @@ struct sockaddr_xdp {
 
 /* XDP socket options */
 #define XDP_RX_RING			1
+#define XDP_TX_RING			2
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
 #define XDP_UMEM_COMPLETION_RING	5
@@ -47,6 +48,7 @@ struct xdp_umem_reg {
 
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
+#define XDP_PGOFF_TX_RING		 0x80000000
 #define XDP_UMEM_PGOFF_FILL_RING	0x100000000
 #define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index f4a2c5bc6da9..2d7b0c90d996 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -206,7 +206,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		goto out_release;
 	}
 
-	if (!xs->rx) {
+	if (!xs->rx && !xs->tx) {
 		err = -EINVAL;
 		goto out_unlock;
 	}
@@ -291,6 +291,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 
 	switch (optname) {
 	case XDP_RX_RING:
+	case XDP_TX_RING:
 	{
 		struct xsk_queue **q;
 		int entries;
@@ -301,7 +302,7 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 			return -EFAULT;
 
 		mutex_lock(&xs->mutex);
-		q = &xs->rx;
+		q = (optname == XDP_TX_RING) ? &xs->tx : &xs->rx;
 		err = xsk_init_queue(entries, q, false);
 		mutex_unlock(&xs->mutex);
 		return err;
@@ -372,6 +373,8 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 
 	if (offset == XDP_PGOFF_RX_RING) {
 		q = xs->rx;
+	} else if (offset == XDP_PGOFF_TX_RING) {
+		q = xs->tx;
 	} else {
 		if (!xs->umem)
 			return -EINVAL;
@@ -431,6 +434,7 @@ static void xsk_destruct(struct sock *sk)
 		return;
 
 	xskq_destroy(xs->rx);
+	xskq_destroy(xs->tx);
 	xdp_put_umem(xs->umem);
 
 	sk_refcnt_debug_dec(sk);
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 12/15] dev: packet: make packet_direct_xmit a common function
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

The new dev_direct_xmit will be used by AF_XDP in later commits.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c            | 38 ++++++++++++++++++++++++++++++++++++++
 net/packet/af_packet.c    | 42 +++++-------------------------------------
 3 files changed, 44 insertions(+), 37 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 366c32891158..a30435118530 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2486,6 +2486,7 @@ void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *newskb);
 int dev_queue_xmit(struct sk_buff *skb);
 int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv);
+int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
 int register_netdevice(struct net_device *dev);
 void unregister_netdevice_queue(struct net_device *dev, struct list_head *head);
 void unregister_netdevice_many(struct list_head *head);
diff --git a/net/core/dev.c b/net/core/dev.c
index aea36b5a2fed..d3fdc86516e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3625,6 +3625,44 @@ int dev_queue_xmit_accel(struct sk_buff *skb, void *accel_priv)
 }
 EXPORT_SYMBOL(dev_queue_xmit_accel);
 
+int dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
+{
+	struct net_device *dev = skb->dev;
+	struct sk_buff *orig_skb = skb;
+	struct netdev_queue *txq;
+	int ret = NETDEV_TX_BUSY;
+	bool again = false;
+
+	if (unlikely(!netif_running(dev) ||
+		     !netif_carrier_ok(dev)))
+		goto drop;
+
+	skb = validate_xmit_skb_list(skb, dev, &again);
+	if (skb != orig_skb)
+		goto drop;
+
+	skb_set_queue_mapping(skb, queue_id);
+	txq = skb_get_tx_queue(dev, skb);
+
+	local_bh_disable();
+
+	HARD_TX_LOCK(dev, txq, smp_processor_id());
+	if (!netif_xmit_frozen_or_drv_stopped(txq))
+		ret = netdev_start_xmit(skb, dev, txq, false);
+	HARD_TX_UNLOCK(dev, txq);
+
+	local_bh_enable();
+
+	if (!dev_xmit_complete(ret))
+		kfree_skb(skb);
+
+	return ret;
+drop:
+	atomic_long_inc(&dev->tx_dropped);
+	kfree_skb_list(skb);
+	return NET_XMIT_DROP;
+}
+EXPORT_SYMBOL(dev_direct_xmit);
 
 /*************************************************************************
  *			Receiver routines
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 01f3515cada0..611a26d5235c 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -209,7 +209,7 @@ static void prb_clear_rxhash(struct tpacket_kbdq_core *,
 static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
 		struct tpacket3_hdr *);
 static void packet_flush_mclist(struct sock *sk);
-static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb);
+static u16 packet_pick_tx_queue(struct sk_buff *skb);
 
 struct packet_skb_cb {
 	union {
@@ -243,40 +243,7 @@ static void __fanout_link(struct sock *sk, struct packet_sock *po);
 
 static int packet_direct_xmit(struct sk_buff *skb)
 {
-	struct net_device *dev = skb->dev;
-	struct sk_buff *orig_skb = skb;
-	struct netdev_queue *txq;
-	int ret = NETDEV_TX_BUSY;
-	bool again = false;
-
-	if (unlikely(!netif_running(dev) ||
-		     !netif_carrier_ok(dev)))
-		goto drop;
-
-	skb = validate_xmit_skb_list(skb, dev, &again);
-	if (skb != orig_skb)
-		goto drop;
-
-	packet_pick_tx_queue(dev, skb);
-	txq = skb_get_tx_queue(dev, skb);
-
-	local_bh_disable();
-
-	HARD_TX_LOCK(dev, txq, smp_processor_id());
-	if (!netif_xmit_frozen_or_drv_stopped(txq))
-		ret = netdev_start_xmit(skb, dev, txq, false);
-	HARD_TX_UNLOCK(dev, txq);
-
-	local_bh_enable();
-
-	if (!dev_xmit_complete(ret))
-		kfree_skb(skb);
-
-	return ret;
-drop:
-	atomic_long_inc(&dev->tx_dropped);
-	kfree_skb_list(skb);
-	return NET_XMIT_DROP;
+	return dev_direct_xmit(skb, packet_pick_tx_queue(skb));
 }
 
 static struct net_device *packet_cached_dev_get(struct packet_sock *po)
@@ -313,8 +280,9 @@ static u16 __packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
 	return (u16) raw_smp_processor_id() % dev->real_num_tx_queues;
 }
 
-static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
+static u16 packet_pick_tx_queue(struct sk_buff *skb)
 {
+	struct net_device *dev = skb->dev;
 	const struct net_device_ops *ops = dev->netdev_ops;
 	u16 queue_index;
 
@@ -326,7 +294,7 @@ static void packet_pick_tx_queue(struct net_device *dev, struct sk_buff *skb)
 		queue_index = __packet_pick_tx_queue(dev, skb);
 	}
 
-	skb_set_queue_mapping(skb, queue_index);
+	return queue_index;
 }
 
 /* __register_prot_hook must be invoked through register_prot_hook
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 13/15] xsk: support for Tx
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

Here, Tx support is added. The user fills the Tx queue with frames to
be sent by the kernel, and let's the kernel know using the sendmsg
syscall.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 net/xdp/xsk.c       | 111 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 net/xdp/xsk_queue.h |  93 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 200 insertions(+), 4 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 2d7b0c90d996..b33c535c7996 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -36,6 +36,8 @@
 #include "xsk_queue.h"
 #include "xdp_umem.h"
 
+#define TX_BATCH_SIZE 16
+
 static struct xdp_sock *xdp_sk(struct sock *sk)
 {
 	return (struct xdp_sock *)sk;
@@ -101,6 +103,108 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return err;
 }
 
+static void xsk_destruct_skb(struct sk_buff *skb)
+{
+	u32 id = (u32)(long)skb_shinfo(skb)->destructor_arg;
+	struct xdp_sock *xs = xdp_sk(skb->sk);
+
+	WARN_ON_ONCE(xskq_produce_id(xs->umem->cq, id));
+
+	sock_wfree(skb);
+}
+
+static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
+			    size_t total_len)
+{
+	bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
+	u32 max_batch = TX_BATCH_SIZE;
+	struct xdp_sock *xs = xdp_sk(sk);
+	bool sent_frame = false;
+	struct xdp_desc desc;
+	struct sk_buff *skb;
+	int err = 0;
+
+	if (unlikely(!xs->tx))
+		return -ENOBUFS;
+	if (need_wait)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&xs->mutex);
+
+	while (xskq_peek_desc(xs->tx, &desc)) {
+		char *buffer;
+		u32 id, len;
+
+		if (max_batch-- == 0) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		if (xskq_reserve_id(xs->umem->cq)) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		len = desc.len;
+		if (unlikely(len > xs->dev->mtu)) {
+			err = -EMSGSIZE;
+			goto out;
+		}
+
+		skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
+		if (unlikely(!skb)) {
+			err = -EAGAIN;
+			goto out;
+		}
+
+		skb_put(skb, len);
+		id = desc.idx;
+		buffer = xdp_umem_get_data(xs->umem, id) + desc.offset;
+		err = skb_store_bits(skb, 0, buffer, len);
+		if (unlikely(err)) {
+			kfree_skb(skb);
+			goto out;
+		}
+
+		skb->dev = xs->dev;
+		skb->priority = sk->sk_priority;
+		skb->mark = sk->sk_mark;
+		skb_shinfo(skb)->destructor_arg = (void *)(long)id;
+		skb->destructor = xsk_destruct_skb;
+
+		err = dev_direct_xmit(skb, xs->queue_id);
+		/* Ignore NET_XMIT_CN as packet might have been sent */
+		if (err == NET_XMIT_DROP || err == NETDEV_TX_BUSY) {
+			err = -EAGAIN;
+			/* SKB consumed by dev_direct_xmit() */
+			goto out;
+		}
+
+		sent_frame = true;
+		xskq_discard_desc(xs->tx);
+	}
+
+out:
+	if (sent_frame)
+		sk->sk_write_space(sk);
+
+	mutex_unlock(&xs->mutex);
+	return err;
+}
+
+static int xsk_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+
+	if (unlikely(!xs->dev))
+		return -ENXIO;
+	if (unlikely(!(xs->dev->flags & IFF_UP)))
+		return -ENETDOWN;
+
+	return xsk_generic_xmit(sk, m, total_len);
+}
+
 static unsigned int xsk_poll(struct file *file, struct socket *sock,
 			     struct poll_table_struct *wait)
 {
@@ -110,6 +214,8 @@ static unsigned int xsk_poll(struct file *file, struct socket *sock,
 
 	if (xs->rx && !xskq_empty_desc(xs->rx))
 		mask |= POLLIN | POLLRDNORM;
+	if (xs->tx && !xskq_full_desc(xs->tx))
+		mask |= POLLOUT | POLLWRNORM;
 
 	return mask;
 }
@@ -270,6 +376,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 	xs->queue_id = sxdp->sxdp_queue_id;
 
 	xskq_set_umem(xs->rx, &xs->umem->props);
+	xskq_set_umem(xs->tx, &xs->umem->props);
 
 out_unlock:
 	if (err)
@@ -383,8 +490,6 @@ static int xsk_mmap(struct file *file, struct socket *sock,
 			q = xs->umem->fq;
 		else if (offset == XDP_UMEM_PGOFF_COMPLETION_RING)
 			q = xs->umem->cq;
-		else
-			return -EINVAL;
 	}
 
 	if (!q)
@@ -420,7 +525,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	xsk_setsockopt,
 	.getsockopt =	sock_no_getsockopt,
-	.sendmsg =	sock_no_sendmsg,
+	.sendmsg =	xsk_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
 	.mmap =		xsk_mmap,
 	.sendpage =	sock_no_sendpage,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 0a9b92b4f93a..3497e8808608 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -111,7 +111,93 @@ static inline void xskq_discard_id(struct xsk_queue *q)
 	(void)xskq_validate_id(q);
 }
 
-/* Rx queue */
+static inline int xskq_produce_id(struct xsk_queue *q, u32 id)
+{
+	struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
+
+	ring->desc[q->prod_tail++ & q->ring_mask] = id;
+
+	/* Order producer and data */
+	smp_wmb();
+
+	WRITE_ONCE(q->ring->producer, q->prod_tail);
+	return 0;
+}
+
+static inline int xskq_reserve_id(struct xsk_queue *q)
+{
+	if (xskq_nb_free(q, q->prod_head, 1) == 0)
+		return -ENOSPC;
+
+	q->prod_head++;
+	return 0;
+}
+
+/* Rx/Tx queue */
+
+static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
+{
+	u32 buff_len;
+
+	if (unlikely(d->idx >= q->umem_props.nframes)) {
+		q->invalid_descs++;
+		return false;
+	}
+
+	buff_len = q->umem_props.frame_size;
+	if (unlikely(d->len > buff_len || d->len == 0 ||
+		     d->offset > buff_len || d->offset + d->len > buff_len)) {
+		q->invalid_descs++;
+		return false;
+	}
+
+	return true;
+}
+
+static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
+						  struct xdp_desc *desc)
+{
+	while (q->cons_tail != q->cons_head) {
+		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
+		unsigned int idx = q->cons_tail & q->ring_mask;
+
+		if (xskq_is_valid_desc(q, &ring->desc[idx])) {
+			if (desc)
+				*desc = ring->desc[idx];
+			return desc;
+		}
+
+		q->cons_tail++;
+	}
+
+	return NULL;
+}
+
+static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
+					      struct xdp_desc *desc)
+{
+	struct xdp_rxtx_ring *ring;
+
+	if (q->cons_tail == q->cons_head) {
+		WRITE_ONCE(q->ring->consumer, q->cons_tail);
+		q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
+
+		/* Order consumer and data */
+		smp_rmb();
+
+		return xskq_validate_desc(q, desc);
+	}
+
+	ring = (struct xdp_rxtx_ring *)q->ring;
+	*desc = ring->desc[q->cons_tail & q->ring_mask];
+	return desc;
+}
+
+static inline void xskq_discard_desc(struct xsk_queue *q)
+{
+	q->cons_tail++;
+	(void)xskq_validate_desc(q, NULL);
+}
 
 static inline int xskq_produce_batch_desc(struct xsk_queue *q,
 					  u32 id, u32 len, u16 offset)
@@ -139,6 +225,11 @@ static inline void xskq_produce_flush_desc(struct xsk_queue *q)
 	WRITE_ONCE(q->ring->producer, q->prod_tail);
 }
 
+static inline bool xskq_full_desc(struct xsk_queue *q)
+{
+	return (xskq_nb_avail(q, q->nentries) == q->nentries);
+}
+
 static inline bool xskq_empty_desc(struct xsk_queue *q)
 {
 	return (xskq_nb_free(q, q->prod_tail, 1) == q->nentries);
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 14/15] xsk: statistics support
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

In this commit, a new getsockopt is added: XDP_STATISTICS. This is
used to obtain stats from the sockets.

v2: getsockopt now returns size of stats structure.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 include/uapi/linux/if_xdp.h |  7 +++++++
 net/xdp/xsk.c               | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 net/xdp/xsk_queue.h         |  5 +++++
 3 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index e2ea878d025c..77b88c4efe98 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -38,6 +38,7 @@ struct sockaddr_xdp {
 #define XDP_UMEM_REG			3
 #define XDP_UMEM_FILL_RING		4
 #define XDP_UMEM_COMPLETION_RING	5
+#define XDP_STATISTICS			6
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -46,6 +47,12 @@ struct xdp_umem_reg {
 	__u32 frame_headroom; /* Frame head room */
 };
 
+struct xdp_statistics {
+	__u64 rx_dropped; /* Dropped for reasons other than invalid desc */
+	__u64 rx_invalid_descs; /* Dropped due to invalid descriptor */
+	__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
+};
+
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
 #define XDP_PGOFF_TX_RING		 0x80000000
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b33c535c7996..009c5af5bba5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -468,6 +468,49 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 	return -ENOPROTOOPT;
 }
 
+static int xsk_getsockopt(struct socket *sock, int level, int optname,
+			  char __user *optval, int __user *optlen)
+{
+	struct sock *sk = sock->sk;
+	struct xdp_sock *xs = xdp_sk(sk);
+	int len;
+
+	if (level != SOL_XDP)
+		return -ENOPROTOOPT;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	if (len < 0)
+		return -EINVAL;
+
+	switch (optname) {
+	case XDP_STATISTICS:
+	{
+		struct xdp_statistics stats;
+
+		if (len < sizeof(stats))
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		stats.rx_dropped = xs->rx_dropped;
+		stats.rx_invalid_descs = xskq_nb_invalid_descs(xs->rx);
+		stats.tx_invalid_descs = xskq_nb_invalid_descs(xs->tx);
+		mutex_unlock(&xs->mutex);
+
+		if (copy_to_user(optval, &stats, sizeof(stats)))
+			return -EFAULT;
+		if (put_user(sizeof(stats), optlen))
+			return -EFAULT;
+
+		return 0;
+	}
+	default:
+		break;
+	}
+
+	return -EOPNOTSUPP;
+}
+
 static int xsk_mmap(struct file *file, struct socket *sock,
 		    struct vm_area_struct *vma)
 {
@@ -524,7 +567,7 @@ static const struct proto_ops xsk_proto_ops = {
 	.listen =	sock_no_listen,
 	.shutdown =	sock_no_shutdown,
 	.setsockopt =	xsk_setsockopt,
-	.getsockopt =	sock_no_getsockopt,
+	.getsockopt =	xsk_getsockopt,
 	.sendmsg =	xsk_sendmsg,
 	.recvmsg =	sock_no_recvmsg,
 	.mmap =		xsk_mmap,
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 3497e8808608..7aa9a535db0e 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -36,6 +36,11 @@ struct xsk_queue {
 
 /* Common functions operating for both RXTX and umem queues */
 
+static inline u64 xskq_nb_invalid_descs(struct xsk_queue *q)
+{
+	return q ? q->invalid_descs : 0;
+}
+
 static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
 {
 	u32 entries = q->prod_tail - q->cons_tail;
-- 
2.14.1

^ permalink raw reply related

* [PATCH bpf-next v2 15/15] samples/bpf: sample application and documentation for AF_XDP sockets
From: Björn Töpel @ 2018-04-27 12:17 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, alexander.h.duyck, alexander.duyck,
	john.fastabend, ast, brouer, willemdebruijn.kernel, daniel, mst,
	netdev
  Cc: michael.lundkvist, jesse.brandeburg, anjali.singhai, qi.z.zhang,
	Björn Töpel
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

From: Magnus Karlsson <magnus.karlsson@intel.com>

This is a sample application for AF_XDP sockets. The application
supports three different modes of operation: rxdrop, txonly and l2fwd.

To show-case a simple round-robin load-balancing between a set of
sockets in an xskmap, set the RR_LB compile time define option to 1 in
"xdpsock.h".

v2: The entries variable was calculated twice in {umem,xq}_nb_avail.

Co-authored-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
---
 Documentation/networking/af_xdp.rst | 297 +++++++++++
 Documentation/networking/index.rst  |   1 +
 samples/bpf/Makefile                |   4 +
 samples/bpf/xdpsock.h               |  11 +
 samples/bpf/xdpsock_kern.c          |  56 +++
 samples/bpf/xdpsock_user.c          | 948 ++++++++++++++++++++++++++++++++++++
 6 files changed, 1317 insertions(+)
 create mode 100644 Documentation/networking/af_xdp.rst
 create mode 100644 samples/bpf/xdpsock.h
 create mode 100644 samples/bpf/xdpsock_kern.c
 create mode 100644 samples/bpf/xdpsock_user.c

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
new file mode 100644
index 000000000000..91928d9ee4bf
--- /dev/null
+++ b/Documentation/networking/af_xdp.rst
@@ -0,0 +1,297 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+AF_XDP
+======
+
+Overview
+========
+
+AF_XDP is an address family that is optimized for high performance
+packet processing.
+
+This document assumes that the reader is familiar with BPF and XDP. If
+not, the Cilium project has an excellent reference guide at
+http://cilium.readthedocs.io/en/doc-1.0/bpf/.
+
+Using the XDP_REDIRECT action from an XDP program, the program can
+redirect ingress frames to other XDP enabled netdevs, using the
+bpf_redirect_map() function. AF_XDP sockets enable the possibility for
+XDP programs to redirect frames to a memory buffer in a user-space
+application.
+
+An AF_XDP socket (XSK) is created with the normal socket()
+syscall. Associated with each XSK are two rings: the RX ring and the
+TX ring. A socket can receive packets on the RX ring and it can send
+packets on the TX ring. These rings are registered and sized with the
+setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory
+to have at least one of these rings for each socket. An RX or TX
+descriptor ring points to a data buffer in a memory area called a
+UMEM. RX and TX can share the same UMEM so that a packet does not have
+to be copied between RX and TX. Moreover, if a packet needs to be kept
+for a while due to a possible retransmit, the descriptor that points
+to that packet can be changed to point to another and reused right
+away. This again avoids copying data.
+
+The UMEM consists of a number of equally size frames and each frame
+has a unique frame id. A descriptor in one of the rings references a
+frame by referencing its frame id. The user space allocates memory for
+this UMEM using whatever means it feels is most appropriate (malloc,
+mmap, huge pages, etc). This memory area is then registered with the
+kernel using the new setsockopt XDP_UMEM_REG. The UMEM also has two
+rings: the FILL ring and the COMPLETION ring. The fill ring is used by
+the application to send down frame ids for the kernel to fill in with
+RX packet data. References to these frames will then appear in the RX
+ring once each packet has been received. The completion ring, on the
+other hand, contains frame ids that the kernel has transmitted
+completely and can now be used again by user space, for either TX or
+RX. Thus, the frame ids appearing in the completion ring are ids that
+were previously transmitted using the TX ring. In summary, the RX and
+FILL rings are used for the RX path and the TX and COMPLETION rings
+are used for the TX path.
+
+The socket is then finally bound with a bind() call to a device and a
+specific queue id on that device, and it is not until bind is
+completed that traffic starts to flow.
+
+The UMEM can be shared between processes, if desired. If a process
+wants to do this, it simply skips the registration of the UMEM and its
+corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind
+call and submits the XSK of the process it would like to share UMEM
+with as well as its own newly created XSK socket. The new process will
+then receive frame id references in its own RX ring that point to this
+shared UMEM. Note that since the ring structures are single-consumer /
+single-producer (for performance reasons), the new process has to
+create its own socket with associated RX and TX rings, since it cannot
+share this with the other process. This is also the reason that there
+is only one set of FILL and COMPLETION rings per UMEM. It is the
+responsibility of a single process to handle the UMEM.
+
+How is then packets distributed from an XDP program to the XSKs? There
+is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The
+user-space application can place an XSK at an arbitrary place in this
+map. The XDP program can then redirect a packet to a specific index in
+this map and at this point XDP validates that the XSK in that map was
+indeed bound to that device and ring number. If not, the packet is
+dropped. If the map is empty at that index, the packet is also
+dropped. This also means that it is currently mandatory to have an XDP
+program loaded (and one XSK in the XSKMAP) to be able to get any
+traffic to user space through the XSK.
+
+AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
+driver does not have support for XDP, or XDP_SKB is explicitly chosen
+when loading the XDP program, XDP_SKB mode is employed that uses SKBs
+together with the generic XDP support and copies out the data to user
+space. A fallback mode that works for any network device. On the other
+hand, if the driver has support for XDP, it will be used by the AF_XDP
+code to provide better performance, but there is still a copy of the
+data into user space.
+
+Concepts
+========
+
+In order to use an AF_XDP socket, a number of associated objects need
+to be setup.
+
+Jonathan Corbet has also written an excellent article on LWN,
+"Accelerating networking with AF_XDP". It can be found at
+https://lwn.net/Articles/750845/.
+
+UMEM
+----
+
+UMEM is a region of virtual contiguous memory, divided into
+equal-sized frames. An UMEM is associated to a netdev and a specific
+queue id of that netdev. It is created and configured (frame size,
+frame headroom, start address and size) by using the XDP_UMEM_REG
+setsockopt system call. A UMEM is bound to a netdev and queue id, via
+the bind() system call.
+
+An AF_XDP is socket linked to a single UMEM, but one UMEM can have
+multiple AF_XDP sockets. To share an UMEM created via one socket A,
+the next socket B can do this by setting the XDP_SHARED_UMEM flag in
+struct sockaddr_xdp member sxdp_flags, and passing the file descriptor
+of A to struct sockaddr_xdp member sxdp_shared_umem_fd.
+
+The UMEM has two single-producer/single-consumer rings, that are used
+to transfer ownership of UMEM frames between the kernel and the
+user-space application.
+
+Rings
+-----
+
+There are a four different kind of rings: Fill, Completion, RX and
+TX. All rings are single-producer/single-consumer, so the user-space
+application need explicit synchronization of multiple
+processes/threads are reading/writing to them.
+
+The UMEM uses two rings: Fill and Completion. Each socket associated
+with the UMEM must have an RX queue, TX queue or both. Say, that there
+is a setup with four sockets (all doing TX and RX). Then there will be
+one Fill ring, one Completion ring, four TX rings and four RX rings.
+
+The rings are head(producer)/tail(consumer) based rings. A producer
+writes the data ring at the index pointed out by struct xdp_ring
+producer member, and increasing the producer index. A consumer reads
+the data ring at the index pointed out by struct xdp_ring consumer
+member, and increasing the consumer index.
+
+The rings are configured and created via the _RING setsockopt system
+calls and mmapped to user-space using the appropriate offset to mmap()
+(XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and
+XDP_UMEM_PGOFF_COMPLETION_RING).
+
+The size of the rings need to be of size power of two.
+
+UMEM Fill Ring
+~~~~~~~~~~~~~~
+
+The Fill ring is used to transfer ownership of UMEM frames from
+user-space to kernel-space. The UMEM indicies are passed in the
+ring. As an example, if the UMEM is 64k and each frame is 4k, then the
+UMEM has 16 frames and can pass indicies between 0 and 15.
+
+Frames passed to the kernel are used for the ingress path (RX rings).
+
+The user application produces UMEM indicies to this ring.
+
+UMEM Completetion Ring
+~~~~~~~~~~~~~~~~~~~~~~
+
+The Completion Ring is used transfer ownership of UMEM frames from
+kernel-space to user-space. Just like the Fill ring, UMEM indicies are
+used.
+
+Frames passed from the kernel to user-space are frames that has been
+sent (TX ring) and can be used by user-space again.
+
+The user application consumes UMEM indicies from this ring.
+
+
+RX Ring
+~~~~~~~
+
+The RX ring is the receiving side of a socket. Each entry in the ring
+is a struct xdp_desc descriptor. The descriptor contains UMEM index
+(idx), the length of the data (len), the offset into the frame
+(offset).
+
+If no frames have been passed to kernel via the Fill ring, no
+descriptors will (or can) appear on the RX ring.
+
+The user application consumes struct xdp_desc descriptors from this
+ring.
+
+TX Ring
+~~~~~~~
+
+The TX ring is used to send frames. The struct xdp_desc descriptor is
+filled (index, length and offset) and passed into the ring.
+
+To start the transfer a sendmsg() system call is required. This might
+be relaxed in the future.
+
+The user application produces struct xdp_desc descriptors to this
+ring.
+
+XSKMAP / BPF_MAP_TYPE_XSKMAP
+----------------------------
+
+On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that
+is used in conjunction with bpf_redirect_map() to pass the ingress
+frame to a socket.
+
+The user application inserts the socket into the map, via the bpf()
+system call.
+
+Note that if an XDP program tries to redirect to a socket that does
+not match the queue configuration and netdev, the frame will be
+dropped. E.g. an AF_XDP socket is bound to netdev eth0 and
+queue 17. Only the XDP program executing for eth0 and queue 17 will
+successfully pass data to the socket. Please refer to the sample
+application (samples/bpf/) in for an example.
+
+Usage
+=====
+
+In order to use AF_XDP sockets there are two parts needed. The
+user-space application and the XDP program. For a complete setup and
+usage example, please refer to the sample application. The user-space
+side is xdpsock_user.c and the XDP side xdpsock_kern.c.
+
+Naive ring dequeue and enqueue could look like this::
+
+    // typedef struct xdp_rxtx_ring RING;
+    // typedef struct xdp_umem_ring RING;
+
+    // typedef struct xdp_desc RING_TYPE;
+    // typedef __u32 RING_TYPE;
+
+    int dequeue_one(RING *ring, RING_TYPE *item)
+    {
+        __u32 entries = ring->ptrs.producer - ring->ptrs.consumer;
+
+        if (entries == 0)
+            return -1;
+
+        // read-barrier!
+
+        *item = ring->desc[ring->ptrs.consumer & (RING_SIZE - 1)];
+        ring->ptrs.consumer++;
+        return 0;
+    }
+
+    int enqueue_one(RING *ring, const RING_TYPE *item)
+    {
+        u32 free_entries = RING_SIZE - (ring->ptrs.producer - ring->ptrs.consumer);
+
+        if (free_entries == 0)
+            return -1;
+
+        ring->desc[ring->ptrs.producer & (RING_SIZE - 1)] = *item;
+
+        // write-barrier!
+
+        ring->ptrs.producer++;
+        return 0;
+    }
+
+
+For a more optimized version, please refer to the sample application.
+
+Sample application
+==================
+
+There is a xdpsock benchmarking/test application included that
+demonstrates how to use AF_XDP sockets with both private and shared
+UMEMs. Say that you would like your UDP traffic from port 4242 to end
+up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
+for this::
+
+      ethtool -N p3p2 rx-flow-hash udp4 fn
+      ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
+          action 16
+
+Running the rxdrop benchmark in XDP_DRV mode can then be done
+using::
+
+      samples/bpf/xdpsock -i p3p2 -q 16 -r -N
+
+For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
+can be displayed with "-h", as usual.
+
+Credits
+=======
+
+- Björn Töpel (AF_XDP core)
+- Magnus Karlsson (AF_XDP core)
+- Alexander Duyck
+- Alexei Starovoitov
+- Daniel Borkmann
+- Jesper Dangaard Brouer
+- John Fastabend
+- Jonathan Corbet (LWN coverage)
+- Michael S. Tsirkin
+- Qi Z Zhang
+- Willem de Bruijn
+
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index f204eaff657d..cbd9bdd4a79e 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -6,6 +6,7 @@ Contents:
 .. toctree::
    :maxdepth: 2
 
+   af_xdp
    batman-adv
    can
    dpaa2/index
diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index b853581592fd..c03f8358f12c 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -45,6 +45,7 @@ hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 hostprogs-y += cpustat
 hostprogs-y += xdp_adjust_tail
+hostprogs-y += xdpsock
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -97,6 +98,7 @@ xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
 xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
+xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -150,6 +152,7 @@ always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
 always += cpustat_kern.o
 always += xdp_adjust_tail_kern.o
+always += xdpsock_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -196,6 +199,7 @@ HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 HOSTLOADLIBES_cpustat += -lelf
 HOSTLOADLIBES_xdp_adjust_tail += -lelf
+HOSTLOADLIBES_xdpsock += -lelf -pthread
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/xdpsock.h b/samples/bpf/xdpsock.h
new file mode 100644
index 000000000000..533ab81adfa1
--- /dev/null
+++ b/samples/bpf/xdpsock.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef XDPSOCK_H_
+#define XDPSOCK_H_
+
+/* Power-of-2 number of sockets */
+#define MAX_SOCKS 4
+
+/* Round-robin receive */
+#define RR_LB 0
+
+#endif /* XDPSOCK_H_ */
diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
new file mode 100644
index 000000000000..d8806c41362e
--- /dev/null
+++ b/samples/bpf/xdpsock_kern.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include "bpf_helpers.h"
+
+#include "xdpsock.h"
+
+struct bpf_map_def SEC("maps") qidconf_map = {
+	.type		= BPF_MAP_TYPE_ARRAY,
+	.key_size	= sizeof(int),
+	.value_size	= sizeof(int),
+	.max_entries	= 1,
+};
+
+struct bpf_map_def SEC("maps") xsks_map = {
+	.type = BPF_MAP_TYPE_XSKMAP,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = 4,
+};
+
+struct bpf_map_def SEC("maps") rr_map = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(int),
+	.value_size = sizeof(unsigned int),
+	.max_entries = 1,
+};
+
+SEC("xdp_sock")
+int xdp_sock_prog(struct xdp_md *ctx)
+{
+	int *qidconf, key = 0, idx;
+	unsigned int *rr;
+
+	qidconf = bpf_map_lookup_elem(&qidconf_map, &key);
+	if (!qidconf)
+		return XDP_ABORTED;
+
+	if (*qidconf != ctx->rx_queue_index)
+		return XDP_PASS;
+
+#if RR_LB /* NB! RR_LB is configured in xdpsock.h */
+	rr = bpf_map_lookup_elem(&rr_map, &key);
+	if (!rr)
+		return XDP_ABORTED;
+
+	*rr = (*rr + 1) & (MAX_SOCKS - 1);
+	idx = *rr;
+#else
+	idx = 0;
+#endif
+
+	return bpf_redirect_map(&xsks_map, idx, 0);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
new file mode 100644
index 000000000000..4b8a7cf3e63b
--- /dev/null
+++ b/samples/bpf/xdpsock_user.c
@@ -0,0 +1,948 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2017 - 2018 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <assert.h>
+#include <errno.h>
+#include <getopt.h>
+#include <libgen.h>
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <linux/if_xdp.h>
+#include <linux/if_ether.h>
+#include <net/if.h>
+#include <signal.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <net/ethernet.h>
+#include <sys/resource.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <unistd.h>
+#include <pthread.h>
+#include <locale.h>
+#include <sys/types.h>
+#include <poll.h>
+
+#include "bpf_load.h"
+#include "bpf_util.h"
+#include "libbpf.h"
+
+#include "xdpsock.h"
+
+#ifndef SOL_XDP
+#define SOL_XDP 283
+#endif
+
+#ifndef AF_XDP
+#define AF_XDP 44
+#endif
+
+#ifndef PF_XDP
+#define PF_XDP AF_XDP
+#endif
+
+#define NUM_FRAMES 131072
+#define FRAME_HEADROOM 0
+#define FRAME_SIZE 2048
+#define NUM_DESCS 1024
+#define BATCH_SIZE 16
+
+#define FQ_NUM_DESCS 1024
+#define CQ_NUM_DESCS 1024
+
+#define DEBUG_HEXDUMP 0
+
+typedef __u32 u32;
+
+static unsigned long prev_time;
+
+enum benchmark_type {
+	BENCH_RXDROP = 0,
+	BENCH_TXONLY = 1,
+	BENCH_L2FWD = 2,
+};
+
+static enum benchmark_type opt_bench = BENCH_RXDROP;
+static u32 opt_xdp_flags;
+static const char *opt_if = "";
+static int opt_ifindex;
+static int opt_queue;
+static int opt_poll;
+static int opt_shared_packet_buffer;
+static int opt_interval = 1;
+
+struct xdp_umem_uqueue {
+	u32 cached_prod;
+	u32 cached_cons;
+	u32 mask;
+	u32 size;
+	struct xdp_umem_ring *ring;
+};
+
+struct xdp_umem {
+	char (*frames)[FRAME_SIZE];
+	struct xdp_umem_uqueue fq;
+	struct xdp_umem_uqueue cq;
+	int fd;
+};
+
+struct xdp_uqueue {
+	u32 cached_prod;
+	u32 cached_cons;
+	u32 mask;
+	u32 size;
+	struct xdp_rxtx_ring *ring;
+};
+
+struct xdpsock {
+	struct xdp_uqueue rx;
+	struct xdp_uqueue tx;
+	int sfd;
+	struct xdp_umem *umem;
+	u32 outstanding_tx;
+	unsigned long rx_npkts;
+	unsigned long tx_npkts;
+	unsigned long prev_rx_npkts;
+	unsigned long prev_tx_npkts;
+};
+
+#define MAX_SOCKS 4
+static int num_socks;
+struct xdpsock *xsks[MAX_SOCKS];
+
+static unsigned long get_nsecs(void)
+{
+	struct timespec ts;
+
+	clock_gettime(CLOCK_MONOTONIC, &ts);
+	return ts.tv_sec * 1000000000UL + ts.tv_nsec;
+}
+
+static void dump_stats(void);
+
+#define lassert(expr)							\
+	do {								\
+		if (!(expr)) {						\
+			fprintf(stderr, "%s:%s:%i: Assertion failed: "	\
+				#expr ": errno: %d/\"%s\"\n",		\
+				__FILE__, __func__, __LINE__,		\
+				errno, strerror(errno));		\
+			dump_stats();					\
+			exit(EXIT_FAILURE);				\
+		}							\
+	} while (0)
+
+#define barrier() __asm__ __volatile__("": : :"memory")
+#define u_smp_rmb() barrier()
+#define u_smp_wmb() barrier()
+#define likely(x) __builtin_expect(!!(x), 1)
+#define unlikely(x) __builtin_expect(!!(x), 0)
+
+static const char pkt_data[] =
+	"\x3c\xfd\xfe\x9e\x7f\x71\xec\xb1\xd7\x98\x3a\xc0\x08\x00\x45\x00"
+	"\x00\x2e\x00\x00\x00\x00\x40\x11\x88\x97\x05\x08\x07\x08\xc8\x14"
+	"\x1e\x04\x10\x92\x10\x92\x00\x1a\x6d\xa3\x34\x33\x1f\x69\x40\x6b"
+	"\x54\x59\xb6\x14\x2d\x11\x44\xbf\xaf\xd9\xbe\xaa";
+
+static inline u32 umem_nb_free(struct xdp_umem_uqueue *q, u32 nb)
+{
+	u32 free_entries = q->size - (q->cached_prod - q->cached_cons);
+
+	if (free_entries >= nb)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cached_cons = q->ring->ptrs.consumer;
+
+	return q->size - (q->cached_prod - q->cached_cons);
+}
+
+static inline u32 xq_nb_free(struct xdp_uqueue *q, u32 ndescs)
+{
+	u32 free_entries = q->cached_cons - q->cached_prod;
+
+	if (free_entries >= ndescs)
+		return free_entries;
+
+	/* Refresh the local tail pointer */
+	q->cached_cons = q->ring->ptrs.consumer + q->size;
+	return q->cached_cons - q->cached_prod;
+}
+
+static inline u32 umem_nb_avail(struct xdp_umem_uqueue *q, u32 nb)
+{
+	u32 entries = q->cached_prod - q->cached_cons;
+
+	if (entries == 0) {
+		q->cached_prod = q->ring->ptrs.producer;
+		entries = q->cached_prod - q->cached_cons;
+	}
+
+	return (entries > nb) ? nb : entries;
+}
+
+static inline u32 xq_nb_avail(struct xdp_uqueue *q, u32 ndescs)
+{
+	u32 entries = q->cached_prod - q->cached_cons;
+
+	if (entries == 0) {
+		q->cached_prod = q->ring->ptrs.producer;
+		entries = q->cached_prod - q->cached_cons;
+	}
+
+	return (entries > ndescs) ? ndescs : entries;
+}
+
+static inline int umem_fill_to_kernel_ex(struct xdp_umem_uqueue *fq,
+					 struct xdp_desc *d,
+					 size_t nb)
+{
+	u32 i;
+
+	if (umem_nb_free(fq, nb) < nb)
+		return -ENOSPC;
+
+	for (i = 0; i < nb; i++) {
+		u32 idx = fq->cached_prod++ & fq->mask;
+
+		fq->ring->desc[idx] = d[i].idx;
+	}
+
+	u_smp_wmb();
+
+	fq->ring->ptrs.producer = fq->cached_prod;
+
+	return 0;
+}
+
+static inline int umem_fill_to_kernel(struct xdp_umem_uqueue *fq, u32 *d,
+				      size_t nb)
+{
+	u32 i;
+
+	if (umem_nb_free(fq, nb) < nb)
+		return -ENOSPC;
+
+	for (i = 0; i < nb; i++) {
+		u32 idx = fq->cached_prod++ & fq->mask;
+
+		fq->ring->desc[idx] = d[i];
+	}
+
+	u_smp_wmb();
+
+	fq->ring->ptrs.producer = fq->cached_prod;
+
+	return 0;
+}
+
+static inline size_t umem_complete_from_kernel(struct xdp_umem_uqueue *cq,
+					       u32 *d, size_t nb)
+{
+	u32 idx, i, entries = umem_nb_avail(cq, nb);
+
+	u_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = cq->cached_cons++ & cq->mask;
+		d[i] = cq->ring->desc[idx];
+	}
+
+	if (entries > 0) {
+		u_smp_wmb();
+
+		cq->ring->ptrs.consumer = cq->cached_cons;
+	}
+
+	return entries;
+}
+
+static inline void *xq_get_data(struct xdpsock *xsk, __u32 idx, __u32 off)
+{
+	lassert(idx < NUM_FRAMES);
+	return &xsk->umem->frames[idx][off];
+}
+
+static inline int xq_enq(struct xdp_uqueue *uq,
+			 const struct xdp_desc *descs,
+			 unsigned int ndescs)
+{
+	struct xdp_rxtx_ring *r = uq->ring;
+	unsigned int i;
+
+	if (xq_nb_free(uq, ndescs) < ndescs)
+		return -ENOSPC;
+
+	for (i = 0; i < ndescs; i++) {
+		u32 idx = uq->cached_prod++ & uq->mask;
+
+		r->desc[idx].idx = descs[i].idx;
+		r->desc[idx].len = descs[i].len;
+		r->desc[idx].offset = descs[i].offset;
+	}
+
+	u_smp_wmb();
+
+	r->ptrs.producer = uq->cached_prod;
+	return 0;
+}
+
+static inline int xq_enq_tx_only(struct xdp_uqueue *uq,
+				 __u32 idx, unsigned int ndescs)
+{
+	struct xdp_rxtx_ring *q = uq->ring;
+	unsigned int i;
+
+	if (xq_nb_free(uq, ndescs) < ndescs)
+		return -ENOSPC;
+
+	for (i = 0; i < ndescs; i++) {
+		u32 idx = uq->cached_prod++ & uq->mask;
+
+		q->desc[idx].idx	= idx + i;
+		q->desc[idx].len	= sizeof(pkt_data) - 1;
+		q->desc[idx].offset	= 0;
+	}
+
+	u_smp_wmb();
+
+	q->ptrs.producer = uq->cached_prod;
+	return 0;
+}
+
+static inline int xq_deq(struct xdp_uqueue *uq,
+			 struct xdp_desc *descs,
+			 int ndescs)
+{
+	struct xdp_rxtx_ring *r = uq->ring;
+	unsigned int idx;
+	int i, entries;
+
+	entries = xq_nb_avail(uq, ndescs);
+
+	u_smp_rmb();
+
+	for (i = 0; i < entries; i++) {
+		idx = uq->cached_cons++ & uq->mask;
+		descs[i] = r->desc[idx];
+	}
+
+	if (entries > 0) {
+		u_smp_wmb();
+
+		r->ptrs.consumer = uq->cached_cons;
+	}
+
+	return entries;
+}
+
+static void swap_mac_addresses(void *data)
+{
+	struct ether_header *eth = (struct ether_header *)data;
+	struct ether_addr *src_addr = (struct ether_addr *)&eth->ether_shost;
+	struct ether_addr *dst_addr = (struct ether_addr *)&eth->ether_dhost;
+	struct ether_addr tmp;
+
+	tmp = *src_addr;
+	*src_addr = *dst_addr;
+	*dst_addr = tmp;
+}
+
+#if DEBUG_HEXDUMP
+static void hex_dump(void *pkt, size_t length, const char *prefix)
+{
+	int i = 0;
+	const unsigned char *address = (unsigned char *)pkt;
+	const unsigned char *line = address;
+	size_t line_size = 32;
+	unsigned char c;
+
+	printf("length = %zu\n", length);
+	printf("%s | ", prefix);
+	while (length-- > 0) {
+		printf("%02X ", *address++);
+		if (!(++i % line_size) || (length == 0 && i % line_size)) {
+			if (length == 0) {
+				while (i++ % line_size)
+					printf("__ ");
+			}
+			printf(" | ");	/* right close */
+			while (line < address) {
+				c = *line++;
+				printf("%c", (c < 33 || c == 255) ? 0x2E : c);
+			}
+			printf("\n");
+			if (length > 0)
+				printf("%s | ", prefix);
+		}
+	}
+	printf("\n");
+}
+#endif
+
+static size_t gen_eth_frame(char *frame)
+{
+	memcpy(frame, pkt_data, sizeof(pkt_data) - 1);
+	return sizeof(pkt_data) - 1;
+}
+
+static struct xdp_umem *xdp_umem_configure(int sfd)
+{
+	int fq_size = FQ_NUM_DESCS, cq_size = CQ_NUM_DESCS;
+	struct xdp_umem_reg mr;
+	struct xdp_umem *umem;
+	void *bufs;
+
+	umem = calloc(1, sizeof(*umem));
+	lassert(umem);
+
+	lassert(posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
+			       NUM_FRAMES * FRAME_SIZE) == 0);
+
+	mr.addr = (__u64)bufs;
+	mr.len = NUM_FRAMES * FRAME_SIZE;
+	mr.frame_size = FRAME_SIZE;
+	mr.frame_headroom = FRAME_HEADROOM;
+
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_FILL_RING, &fq_size,
+			   sizeof(int)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_UMEM_COMPLETION_RING, &cq_size,
+			   sizeof(int)) == 0);
+
+	umem->fq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
+			     FQ_NUM_DESCS * sizeof(u32),
+			     PROT_READ | PROT_WRITE,
+			     MAP_SHARED | MAP_POPULATE, sfd,
+			     XDP_UMEM_PGOFF_FILL_RING);
+	lassert(umem->fq.ring != MAP_FAILED);
+
+	umem->fq.mask = FQ_NUM_DESCS - 1;
+	umem->fq.size = FQ_NUM_DESCS;
+
+	umem->cq.ring = mmap(0, sizeof(struct xdp_umem_ring) +
+			     CQ_NUM_DESCS * sizeof(u32),
+			     PROT_READ | PROT_WRITE,
+			     MAP_SHARED | MAP_POPULATE, sfd,
+			     XDP_UMEM_PGOFF_COMPLETION_RING);
+	lassert(umem->cq.ring != MAP_FAILED);
+
+	umem->cq.mask = CQ_NUM_DESCS - 1;
+	umem->cq.size = CQ_NUM_DESCS;
+
+	umem->frames = (char (*)[FRAME_SIZE])bufs;
+	umem->fd = sfd;
+
+	if (opt_bench == BENCH_TXONLY) {
+		int i;
+
+		for (i = 0; i < NUM_FRAMES; i++)
+			(void)gen_eth_frame(&umem->frames[i][0]);
+	}
+
+	return umem;
+}
+
+static struct xdpsock *xsk_configure(struct xdp_umem *umem)
+{
+	struct sockaddr_xdp sxdp = {};
+	int sfd, ndescs = NUM_DESCS;
+	struct xdpsock *xsk;
+	bool shared = true;
+	u32 i;
+
+	sfd = socket(PF_XDP, SOCK_RAW, 0);
+	lassert(sfd >= 0);
+
+	xsk = calloc(1, sizeof(*xsk));
+	lassert(xsk);
+
+	xsk->sfd = sfd;
+	xsk->outstanding_tx = 0;
+
+	if (!umem) {
+		shared = false;
+		xsk->umem = xdp_umem_configure(sfd);
+	} else {
+		xsk->umem = umem;
+	}
+
+	lassert(setsockopt(sfd, SOL_XDP, XDP_RX_RING,
+			   &ndescs, sizeof(int)) == 0);
+	lassert(setsockopt(sfd, SOL_XDP, XDP_TX_RING,
+			   &ndescs, sizeof(int)) == 0);
+
+	/* Rx */
+	xsk->rx.ring = mmap(NULL,
+			    sizeof(struct xdp_ring) +
+			    NUM_DESCS * sizeof(struct xdp_desc),
+			    PROT_READ | PROT_WRITE,
+			    MAP_SHARED | MAP_POPULATE, sfd,
+			    XDP_PGOFF_RX_RING);
+	lassert(xsk->rx.ring != MAP_FAILED);
+
+	if (!shared) {
+		for (i = 0; i < NUM_DESCS / 2; i++)
+			lassert(umem_fill_to_kernel(&xsk->umem->fq, &i, 1)
+				== 0);
+	}
+
+	/* Tx */
+	xsk->tx.ring = mmap(NULL,
+			 sizeof(struct xdp_ring) +
+			 NUM_DESCS * sizeof(struct xdp_desc),
+			 PROT_READ | PROT_WRITE,
+			 MAP_SHARED | MAP_POPULATE, sfd,
+			 XDP_PGOFF_TX_RING);
+	lassert(xsk->tx.ring != MAP_FAILED);
+
+	xsk->rx.mask = NUM_DESCS - 1;
+	xsk->rx.size = NUM_DESCS;
+
+	xsk->tx.mask = NUM_DESCS - 1;
+	xsk->tx.size = NUM_DESCS;
+
+	sxdp.sxdp_family = PF_XDP;
+	sxdp.sxdp_ifindex = opt_ifindex;
+	sxdp.sxdp_queue_id = opt_queue;
+	if (shared) {
+		sxdp.sxdp_flags = XDP_SHARED_UMEM;
+		sxdp.sxdp_shared_umem_fd = umem->fd;
+	}
+
+	lassert(bind(sfd, (struct sockaddr *)&sxdp, sizeof(sxdp)) == 0);
+
+	return xsk;
+}
+
+static void print_benchmark(bool running)
+{
+	const char *bench_str = "INVALID";
+
+	if (opt_bench == BENCH_RXDROP)
+		bench_str = "rxdrop";
+	else if (opt_bench == BENCH_TXONLY)
+		bench_str = "txonly";
+	else if (opt_bench == BENCH_L2FWD)
+		bench_str = "l2fwd";
+
+	printf("%s:%d %s ", opt_if, opt_queue, bench_str);
+	if (opt_xdp_flags & XDP_FLAGS_SKB_MODE)
+		printf("xdp-skb ");
+	else if (opt_xdp_flags & XDP_FLAGS_DRV_MODE)
+		printf("xdp-drv ");
+	else
+		printf("	");
+
+	if (opt_poll)
+		printf("poll() ");
+
+	if (running) {
+		printf("running...");
+		fflush(stdout);
+	}
+}
+
+static void dump_stats(void)
+{
+	unsigned long now = get_nsecs();
+	long dt = now - prev_time;
+	int i;
+
+	prev_time = now;
+
+	for (i = 0; i < num_socks; i++) {
+		char *fmt = "%-15s %'-11.0f %'-11lu\n";
+		double rx_pps, tx_pps;
+
+		rx_pps = (xsks[i]->rx_npkts - xsks[i]->prev_rx_npkts) *
+			 1000000000. / dt;
+		tx_pps = (xsks[i]->tx_npkts - xsks[i]->prev_tx_npkts) *
+			 1000000000. / dt;
+
+		printf("\n sock%d@", i);
+		print_benchmark(false);
+		printf("\n");
+
+		printf("%-15s %-11s %-11s %-11.2f\n", "", "pps", "pkts",
+		       dt / 1000000000.);
+		printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
+		printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
+
+		xsks[i]->prev_rx_npkts = xsks[i]->rx_npkts;
+		xsks[i]->prev_tx_npkts = xsks[i]->tx_npkts;
+	}
+}
+
+static void *poller(void *arg)
+{
+	(void)arg;
+	for (;;) {
+		sleep(opt_interval);
+		dump_stats();
+	}
+
+	return NULL;
+}
+
+static void int_exit(int sig)
+{
+	(void)sig;
+	dump_stats();
+	bpf_set_link_xdp_fd(opt_ifindex, -1, opt_xdp_flags);
+	exit(EXIT_SUCCESS);
+}
+
+static struct option long_options[] = {
+	{"rxdrop", no_argument, 0, 'r'},
+	{"txonly", no_argument, 0, 't'},
+	{"l2fwd", no_argument, 0, 'l'},
+	{"interface", required_argument, 0, 'i'},
+	{"queue", required_argument, 0, 'q'},
+	{"poll", no_argument, 0, 'p'},
+	{"shared-buffer", no_argument, 0, 's'},
+	{"xdp-skb", no_argument, 0, 'S'},
+	{"xdp-native", no_argument, 0, 'N'},
+	{"interval", required_argument, 0, 'n'},
+	{0, 0, 0, 0}
+};
+
+static void usage(const char *prog)
+{
+	const char *str =
+		"  Usage: %s [OPTIONS]\n"
+		"  Options:\n"
+		"  -r, --rxdrop		Discard all incoming packets (default)\n"
+		"  -t, --txonly		Only send packets\n"
+		"  -l, --l2fwd		MAC swap L2 forwarding\n"
+		"  -i, --interface=n	Run on interface n\n"
+		"  -q, --queue=n	Use queue n (default 0)\n"
+		"  -p, --poll		Use poll syscall\n"
+		"  -s, --shared-buffer	Use shared packet buffer\n"
+		"  -S, --xdp-skb=n	Use XDP skb-mod\n"
+		"  -N, --xdp-native=n	Enfore XDP native mode\n"
+		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
+		"\n";
+	fprintf(stderr, str, prog);
+	exit(EXIT_FAILURE);
+}
+
+static void parse_command_line(int argc, char **argv)
+{
+	int option_index, c;
+
+	opterr = 0;
+
+	for (;;) {
+		c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
+				&option_index);
+		if (c == -1)
+			break;
+
+		switch (c) {
+		case 'r':
+			opt_bench = BENCH_RXDROP;
+			break;
+		case 't':
+			opt_bench = BENCH_TXONLY;
+			break;
+		case 'l':
+			opt_bench = BENCH_L2FWD;
+			break;
+		case 'i':
+			opt_if = optarg;
+			break;
+		case 'q':
+			opt_queue = atoi(optarg);
+			break;
+		case 's':
+			opt_shared_packet_buffer = 1;
+			break;
+		case 'p':
+			opt_poll = 1;
+			break;
+		case 'S':
+			opt_xdp_flags |= XDP_FLAGS_SKB_MODE;
+			break;
+		case 'N':
+			opt_xdp_flags |= XDP_FLAGS_DRV_MODE;
+			break;
+		case 'n':
+			opt_interval = atoi(optarg);
+			break;
+		default:
+			usage(basename(argv[0]));
+		}
+	}
+
+	opt_ifindex = if_nametoindex(opt_if);
+	if (!opt_ifindex) {
+		fprintf(stderr, "ERROR: interface \"%s\" does not exist\n",
+			opt_if);
+		usage(basename(argv[0]));
+	}
+}
+
+static void kick_tx(int fd)
+{
+	int ret;
+
+	ret = sendto(fd, NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (ret >= 0 || errno == ENOBUFS || errno == EAGAIN)
+		return;
+	lassert(0);
+}
+
+static inline void complete_tx_l2fwd(struct xdpsock *xsk)
+{
+	u32 descs[BATCH_SIZE];
+	unsigned int rcvd;
+	size_t ndescs;
+
+	if (!xsk->outstanding_tx)
+		return;
+
+	kick_tx(xsk->sfd);
+	ndescs = (xsk->outstanding_tx > BATCH_SIZE) ? BATCH_SIZE :
+		 xsk->outstanding_tx;
+
+	/* re-add completed Tx buffers */
+	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, ndescs);
+	if (rcvd > 0) {
+		umem_fill_to_kernel(&xsk->umem->fq, descs, rcvd);
+		xsk->outstanding_tx -= rcvd;
+		xsk->tx_npkts += rcvd;
+	}
+}
+
+static inline void complete_tx_only(struct xdpsock *xsk)
+{
+	u32 descs[BATCH_SIZE];
+	unsigned int rcvd;
+
+	if (!xsk->outstanding_tx)
+		return;
+
+	kick_tx(xsk->sfd);
+
+	rcvd = umem_complete_from_kernel(&xsk->umem->cq, descs, BATCH_SIZE);
+	if (rcvd > 0) {
+		xsk->outstanding_tx -= rcvd;
+		xsk->tx_npkts += rcvd;
+	}
+}
+
+static void rx_drop(struct xdpsock *xsk)
+{
+	struct xdp_desc descs[BATCH_SIZE];
+	unsigned int rcvd, i;
+
+	rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
+	if (!rcvd)
+		return;
+
+	for (i = 0; i < rcvd; i++) {
+		u32 idx = descs[i].idx;
+
+		lassert(idx < NUM_FRAMES);
+#if DEBUG_HEXDUMP
+		char *pkt;
+		char buf[32];
+
+		pkt = xq_get_data(xsk, idx, descs[i].offset);
+		sprintf(buf, "idx=%d", idx);
+		hex_dump(pkt, descs[i].len, buf);
+#endif
+	}
+
+	xsk->rx_npkts += rcvd;
+
+	umem_fill_to_kernel_ex(&xsk->umem->fq, descs, rcvd);
+}
+
+static void rx_drop_all(void)
+{
+	struct pollfd fds[MAX_SOCKS + 1];
+	int i, ret, timeout, nfds = 1;
+
+	memset(fds, 0, sizeof(fds));
+
+	for (i = 0; i < num_socks; i++) {
+		fds[i].fd = xsks[i]->sfd;
+		fds[i].events = POLLIN;
+		timeout = 1000; /* 1sn */
+	}
+
+	for (;;) {
+		if (opt_poll) {
+			ret = poll(fds, nfds, timeout);
+			if (ret <= 0)
+				continue;
+		}
+
+		for (i = 0; i < num_socks; i++)
+			rx_drop(xsks[i]);
+	}
+}
+
+static void tx_only(struct xdpsock *xsk)
+{
+	int timeout, ret, nfds = 1;
+	struct pollfd fds[nfds + 1];
+	unsigned int idx = 0;
+
+	memset(fds, 0, sizeof(fds));
+	fds[0].fd = xsk->sfd;
+	fds[0].events = POLLOUT;
+	timeout = 1000; /* 1sn */
+
+	for (;;) {
+		if (opt_poll) {
+			ret = poll(fds, nfds, timeout);
+			if (ret <= 0)
+				continue;
+
+			if (fds[0].fd != xsk->sfd ||
+			    !(fds[0].revents & POLLOUT))
+				continue;
+		}
+
+		if (xq_nb_free(&xsk->tx, BATCH_SIZE) >= BATCH_SIZE) {
+			lassert(xq_enq_tx_only(&xsk->tx, idx, BATCH_SIZE) == 0);
+
+			xsk->outstanding_tx += BATCH_SIZE;
+			idx += BATCH_SIZE;
+			idx %= NUM_FRAMES;
+		}
+
+		complete_tx_only(xsk);
+	}
+}
+
+static void l2fwd(struct xdpsock *xsk)
+{
+	for (;;) {
+		struct xdp_desc descs[BATCH_SIZE];
+		unsigned int rcvd, i;
+		int ret;
+
+		for (;;) {
+			complete_tx_l2fwd(xsk);
+
+			rcvd = xq_deq(&xsk->rx, descs, BATCH_SIZE);
+			if (rcvd > 0)
+				break;
+		}
+
+		for (i = 0; i < rcvd; i++) {
+			char *pkt = xq_get_data(xsk, descs[i].idx,
+						descs[i].offset);
+
+			swap_mac_addresses(pkt);
+#if DEBUG_HEXDUMP
+			char buf[32];
+			u32 idx = descs[i].idx;
+
+			sprintf(buf, "idx=%d", idx);
+			hex_dump(pkt, descs[i].len, buf);
+#endif
+		}
+
+		xsk->rx_npkts += rcvd;
+
+		ret = xq_enq(&xsk->tx, descs, rcvd);
+		lassert(ret == 0);
+		xsk->outstanding_tx += rcvd;
+	}
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	char xdp_filename[256];
+	int i, ret, key = 0;
+	pthread_t pt;
+
+	parse_command_line(argc, argv);
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		fprintf(stderr, "ERROR: setrlimit(RLIMIT_MEMLOCK) \"%s\"\n",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	snprintf(xdp_filename, sizeof(xdp_filename), "%s_kern.o", argv[0]);
+
+	if (load_bpf_file(xdp_filename)) {
+		fprintf(stderr, "ERROR: load_bpf_file %s\n", bpf_log_buf);
+		exit(EXIT_FAILURE);
+	}
+
+	if (!prog_fd[0]) {
+		fprintf(stderr, "ERROR: load_bpf_file: \"%s\"\n",
+			strerror(errno));
+		exit(EXIT_FAILURE);
+	}
+
+	if (bpf_set_link_xdp_fd(opt_ifindex, prog_fd[0], opt_xdp_flags) < 0) {
+		fprintf(stderr, "ERROR: link set xdp fd failed\n");
+		exit(EXIT_FAILURE);
+	}
+
+	ret = bpf_map_update_elem(map_fd[0], &key, &opt_queue, 0);
+	if (ret) {
+		fprintf(stderr, "ERROR: bpf_map_update_elem qidconf\n");
+		exit(EXIT_FAILURE);
+	}
+
+	/* Create sockets... */
+	xsks[num_socks++] = xsk_configure(NULL);
+
+#if RR_LB
+	for (i = 0; i < MAX_SOCKS - 1; i++)
+		xsks[num_socks++] = xsk_configure(xsks[0]->umem);
+#endif
+
+	/* ...and insert them into the map. */
+	for (i = 0; i < num_socks; i++) {
+		key = i;
+		ret = bpf_map_update_elem(map_fd[1], &key, &xsks[i]->sfd, 0);
+		if (ret) {
+			fprintf(stderr, "ERROR: bpf_map_update_elem %d\n", i);
+			exit(EXIT_FAILURE);
+		}
+	}
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+	signal(SIGABRT, int_exit);
+
+	setlocale(LC_ALL, "");
+
+	ret = pthread_create(&pt, NULL, poller, NULL);
+	lassert(ret == 0);
+
+	prev_time = get_nsecs();
+
+	if (opt_bench == BENCH_RXDROP)
+		rx_drop_all();
+	else if (opt_bench == BENCH_TXONLY)
+		tx_only(xsks[0]);
+	else
+		l2fwd(xsks[0]);
+
+	return 0;
+}
-- 
2.14.1

^ permalink raw reply related

* Re: [PATCH bpf-next v2 00/15] Introducing AF_XDP support
From: Björn Töpel @ 2018-04-27 12:21 UTC (permalink / raw)
  To: Bjorn Topel, Karlsson, Magnus, Duyck, Alexander H,
	Alexander Duyck, John Fastabend, Alexei Starovoitov,
	Jesper Dangaard Brouer, Willem de Bruijn, Daniel Borkmann,
	Michael S. Tsirkin, Netdev
  Cc: Björn Töpel, michael.lundkvist, Brandeburg, Jesse,
	Singhai, Anjali, Zhang, Qi Z
In-Reply-To: <20180427121728.18512-1-bjorn.topel@gmail.com>

2018-04-27 14:17 GMT+02:00 Björn Töpel <bjorn.topel@gmail.com>:
> From: Björn Töpel <bjorn.topel@intel.com>
>
> This patch set introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This patch set only supports copy-mode
> for the generic XDP path (XDP_SKB) for both RX and TX and copy-mode
> for RX using the XDP_DRV path. Zero-copy support requires XDP and
> driver changes that Jesper Dangaard Brouer is working on. Some of his
> work has already been accepted. We will publish our zero-copy support
> for RX and TX on top of his patch sets at a later point in time.
>
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is call a UMEM. It consists of a
> number of equally size frames and each frame has a unique frame id. A
> descriptor in one of the queues references a frame by referencing its
> frame id. The user space allocates memory for this UMEM using whatever
> means it feels is most appropriate (malloc, mmap, huge pages,
> etc). This memory area is then registered with the kernel using the new
> setsockopt XDP_UMEM_REG. The UMEM also has two queues: the FILL queue
> and the COMPLETION queue. The fill queue is used by the application to
> send down frame ids for the kernel to fill in with RX packet
> data. References to these frames will then appear in the RX queue of
> the XSK once they have been received. The completion queue, on the
> other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
> The socket is then finally bound with a bind() call to a device and a
> specific queue id on that device, and it is not until bind is
> completed that traffic starts to flow. Note that in this patch set,
> all packet data is copied out to user-space.
>
> A new feature in this patch set is that the UMEM can be shared between
> processes, if desired. If a process wants to do this, it simply skips
> the registration of the UMEM and its corresponding two queues, sets a
> flag in the bind call and submits the XSK of the process it would like
> to share UMEM with as well as its own newly created XSK socket. The
> new process will then receive frame id references in its own RX queue
> that point to this shared UMEM. Note that since the queue structures
> are single-consumer / single-producer (for performance reasons), the
> new process has to create its own socket with associated RX and TX
> queues, since it cannot share this with the other process. This is
> also the reason that there is only one set of FILL and COMPLETION
> queues per UMEM. It is the responsibility of a single process to
> handle the UMEM. If multiple-producer / multiple-consumer queues are
> implemented in the future, this requirement could be relaxed.
>
> How is then packets distributed between these two XSK? We have
> introduced a new BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in
> full). The user-space application can place an XSK at an arbitrary
> place in this map. The XDP program can then redirect a packet to a
> specific index in this map and at this point XDP validates that the
> XSK in that map was indeed bound to that device and queue number. If
> not, the packet is dropped. If the map is empty at that index, the
> packet is also dropped. This also means that it is currently mandatory
> to have an XDP program loaded (and one XSK in the XSKMAP) to be able
> to get any traffic to user space through the XSK.
>
> AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the
> driver does not have support for XDP, or XDP_SKB is explicitly chosen
> when loading the XDP program, XDP_SKB mode is employed that uses SKBs
> together with the generic XDP support and copies out the data to user
> space. A fallback mode that works for any network device. On the other
> hand, if the driver has support for XDP, it will be used by the AF_XDP
> code to provide better performance, but there is still a copy of the
> data into user space.
>
> There is a xdpsock benchmarking/test application included that
> demonstrates how to use AF_XDP sockets with both private and shared
> UMEMs. Say that you would like your UDP traffic from port 4242 to end
> up in queue 16, that we will enable AF_XDP on. Here, we use ethtool
> for this:
>
>       ethtool -N p3p2 rx-flow-hash udp4 fn
>       ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \
>           action 16
>
> Running the rxdrop benchmark in XDP_DRV mode can then be done
> using:
>
>       samples/bpf/xdpsock -i p3p2 -q 16 -r -N
>
> For XDP_SKB mode, use the switch "-S" instead of "-N" and all options
> can be displayed with "-h", as usual.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc version 5.4.0 20160609. The NIC is an
> Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by commercial packet generator HW that is
> generating packets at full 40 Gbit/s line rate.
>
> AF_XDP performance 64 byte packets. Results from V1 in parenthesis.
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       3.0(2.9)   9.5(9.4)
> txpush       2.5(2.5)   NA*
> l2fwd        1.9(1.9)   2.5(2.4) (TX using XDP_SKB in both cases)
>
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV
> rxdrop       2.2(2.1)   3.3(3.3)
> l2fwd        1.4(1.4)   1.8(1.8) (TX using XDP_SKB in both cases)
>
> * NA since we have no support for TX using the XDP_DRV infrastructure
>   in this patch set. This is for a future patch set since it involves
>   changes to the XDP NDOs. Some of this has been upstreamed by Jesper
>   Dangaard Brouer.
>
> XDP performance on our system as a base line:
>
> 64 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      32,921,521  0
>
> 1500 byte packets:
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      16      3,289,491   0
>
> Changes from V1:
>
> * Fixes to bugs spotted by Will in his review
> * Implemented the performance otimization to BPF_MAP_TYPE_XSKMAP
>   suggested by Will
> * Refactored packet_direct_xmit to become a common function
>   in core/dev.c as suggested by Will
> * Added documentation as suggested by Jesper
> * Proper page unpinning as suggested by MST
> * Some minor code cleanups
>
> The structure of the patch set is as follows:
>
> Patches 1-3: Basic socket and umem plumbing
> Patches 4-9: RX support together with the new XSKMAP
> Patches 10-13: TX support
> Patch 14: Statistics support with getsockopt()
> Patch 15: Sample application
>
> We based this patch set on bpf-next commit
>

Oops, I pressed play on tape too soon. We based it on commit
79741a38b4a2 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next").


Björn

> To do for this patch set:
>
> * Syzkaller torture session being worked on
>
> Post-series plan:
>
> * Optimize performance
>
> * Kernel selftest
>
> * Kernel load module support of AF_XDP would be nice. Unclear how to
>   achieve this though since our XDP code depends on net/core.
>
> * Support for AF_XDP sockets without an XPD program loaded. In this
>   case all the traffic on a queue should go up to the user space socket.
>
> * Daniel Borkmann's suggestion for a "copy to XDP socket, and return
>   XDP_PASS" for a tcpdump-like functionality.
>
> * And of course getting to zero-copy support in small increments.
>
> Thanks: Björn and Magnus
>
> Björn Töpel (7):
>   net: initial AF_XDP skeleton
>   xsk: add user memory registration support sockopt
>   xsk: add Rx queue setup and mmap support
>   xsk: add Rx receive functions and poll support
>   bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP
>   xsk: wire up XDP_DRV side of AF_XDP
>   xsk: wire up XDP_SKB side of AF_XDP
>
> Magnus Karlsson (8):
>   xsk: add umem fill queue support and mmap
>   xsk: add support for bind for Rx
>   xsk: add umem completion queue support and mmap
>   xsk: add Tx queue setup and mmap support
>   dev: packet: make packet_direct_xmit a common function
>   xsk: support for Tx
>   xsk: statistics support
>   samples/bpf: sample application and documentation for AF_XDP sockets
>
>  Documentation/networking/af_xdp.rst | 297 +++++++++++
>  Documentation/networking/index.rst  |   1 +
>  MAINTAINERS                         |   8 +
>  include/linux/bpf.h                 |  26 +
>  include/linux/bpf_types.h           |   3 +
>  include/linux/filter.h              |   2 +-
>  include/linux/netdevice.h           |   1 +
>  include/linux/socket.h              |   5 +-
>  include/net/xdp.h                   |   1 +
>  include/net/xdp_sock.h              |  66 +++
>  include/uapi/linux/bpf.h            |   1 +
>  include/uapi/linux/if_xdp.h         |  87 ++++
>  kernel/bpf/Makefile                 |   3 +
>  kernel/bpf/verifier.c               |   8 +-
>  kernel/bpf/xskmap.c                 | 272 +++++++++++
>  net/Kconfig                         |   1 +
>  net/Makefile                        |   1 +
>  net/core/dev.c                      |  73 ++-
>  net/core/filter.c                   |  40 +-
>  net/core/sock.c                     |  12 +-
>  net/core/xdp.c                      |  15 +-
>  net/packet/af_packet.c              |  42 +-
>  net/xdp/Kconfig                     |   7 +
>  net/xdp/Makefile                    |   2 +
>  net/xdp/xdp_umem.c                  | 260 ++++++++++
>  net/xdp/xdp_umem.h                  |  67 +++
>  net/xdp/xdp_umem_props.h            |  23 +
>  net/xdp/xsk.c                       | 656 +++++++++++++++++++++++++
>  net/xdp/xsk_queue.c                 |  73 +++
>  net/xdp/xsk_queue.h                 | 247 ++++++++++
>  samples/bpf/Makefile                |   4 +
>  samples/bpf/xdpsock.h               |  11 +
>  samples/bpf/xdpsock_kern.c          |  56 +++
>  samples/bpf/xdpsock_user.c          | 948 ++++++++++++++++++++++++++++++++++++
>  security/selinux/hooks.c            |   4 +-
>  security/selinux/include/classmap.h |   4 +-
>  36 files changed, 3255 insertions(+), 72 deletions(-)
>  create mode 100644 Documentation/networking/af_xdp.rst
>  create mode 100644 include/net/xdp_sock.h
>  create mode 100644 include/uapi/linux/if_xdp.h
>  create mode 100644 kernel/bpf/xskmap.c
>  create mode 100644 net/xdp/Kconfig
>  create mode 100644 net/xdp/Makefile
>  create mode 100644 net/xdp/xdp_umem.c
>  create mode 100644 net/xdp/xdp_umem.h
>  create mode 100644 net/xdp/xdp_umem_props.h
>  create mode 100644 net/xdp/xsk.c
>  create mode 100644 net/xdp/xsk_queue.c
>  create mode 100644 net/xdp/xsk_queue.h
>  create mode 100644 samples/bpf/xdpsock.h
>  create mode 100644 samples/bpf/xdpsock_kern.c
>  create mode 100644 samples/bpf/xdpsock_user.c
>
> --
> 2.14.1
>

^ permalink raw reply

* Re: [PATCH net] pppoe: check sockaddr length in pppoe_connect()
From: Kevin Easton @ 2018-04-27 12:23 UTC (permalink / raw)
  To: Guillaume Nault; +Cc: netdev, Michal Ostrowski
In-Reply-To: <387ca48810af36f2626049008a795d1adc375cb8.1524494257.git.g.nault@alphalink.fr>

On Mon, Apr 23, 2018 at 04:38:27PM +0200, Guillaume Nault wrote:
> We must validate sockaddr_len, otherwise userspace can pass fewer data
> than we expect and we end up accessing invalid data.
> 
> Fixes: 224cf5ad14c0 ("ppp: Move the PPP drivers")
> Reported-by: syzbot+4f03bdf92fdf9ef5ddab@syzkaller.appspotmail.com
> Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
> ---
>  drivers/net/ppp/pppoe.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
> index 1483bc7b01e1..7df07337d69c 100644
> --- a/drivers/net/ppp/pppoe.c
> +++ b/drivers/net/ppp/pppoe.c
> @@ -620,6 +620,10 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
>  	lock_sock(sk);
>  
>  	error = -EINVAL;
> +
> +	if (sockaddr_len != sizeof(struct sockaddr_pppox))
> +		goto end;
> +
>  	if (sp->sa_protocol != PX_PROTO_OE)
>  		goto end;

There's another bug here - pppoe_connect() should also be validating
sp->sa_family.  My suggested patch was going to be:

diff --git a/drivers/net/ppp/pppoe.c b/drivers/net/ppp/pppoe.c
index 1483bc7..90eb3fd 100644
--- a/drivers/net/ppp/pppoe.c
+++ b/drivers/net/ppp/pppoe.c
@@ -620,6 +620,14 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
        lock_sock(sk);
 
        error = -EINVAL;
+       if (sockaddr_len < sizeof(struct sockaddr_pppox))
+               goto end;
+
+       error = -EAFNOSUPPORT;
+       if (sp->sa_family != AF_PPPOX)
+               goto end;
+
+       error = -EINVAL;
        if (sp->sa_protocol != PX_PROTO_OE)
                goto end;
 
Should I rework this on top of net.git HEAD?

(The same applies to pppol2tp_connect()).

    - Kevin

^ permalink raw reply related

* Re: [PATCH net-next v8 2/4] net: Introduce generic failover module
From: kbuild test robot @ 2018-04-27 12:39 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: kbuild-all, mst, stephen, davem, netdev, virtualization,
	virtio-dev, jesse.brandeburg, alexander.h.duyck, kubakici,
	sridhar.samudrala, jasowang, loseweigh, jiri, aaron.f.brown
In-Reply-To: <1524700768-38627-3-git-send-email-sridhar.samudrala@intel.com>

Hi Sridhar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Sridhar-Samudrala/Enable-virtio_net-to-act-as-a-standby-for-a-passthru-device/20180427-183842
reproduce:
        # apt-get install sparse
        make ARCH=x86_64 allmodconfig
        make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> net/core/net_failover.c:544:39: sparse: incorrect type in argument 1 (different address spaces) @@    expected struct net_device *dev @@    got struct net_devicestruct net_device *dev @@
   net/core/net_failover.c:544:39:    expected struct net_device *dev
   net/core/net_failover.c:544:39:    got struct net_device [noderef] <asn:4>*standby_dev
   net/core/net_failover.c:547:39: sparse: incorrect type in argument 1 (different address spaces) @@    expected struct net_device *dev @@    got struct net_devicestruct net_device *dev @@
   net/core/net_failover.c:547:39:    expected struct net_device *dev
   net/core/net_failover.c:547:39:    got struct net_device [noderef] <asn:4>*primary_dev
>> net/core/net_failover.c:112:12: sparse: context imbalance in 'net_failover_select_queue' - wrong count at exit

vim +544 net/core/net_failover.c

   446	
   447	static int net_failover_slave_register(struct net_device *slave_dev)
   448	{
   449		struct net_failover_info *nfo_info;
   450		struct net_failover_ops *nfo_ops;
   451		struct net_device *failover_dev;
   452		bool slave_is_standby;
   453		u32 orig_mtu;
   454		int err;
   455	
   456		ASSERT_RTNL();
   457	
   458		failover_dev = net_failover_get_bymac(slave_dev->perm_addr, &nfo_ops);
   459		if (!failover_dev)
   460			goto done;
   461	
   462		if (failover_dev->type != slave_dev->type)
   463			goto done;
   464	
   465		if (nfo_ops && nfo_ops->slave_register)
   466			return nfo_ops->slave_register(slave_dev, failover_dev);
   467	
   468		nfo_info = netdev_priv(failover_dev);
   469		slave_is_standby = (slave_dev->dev.parent == failover_dev->dev.parent);
   470		if (slave_is_standby ? rtnl_dereference(nfo_info->standby_dev) :
   471				rtnl_dereference(nfo_info->primary_dev)) {
   472			netdev_err(failover_dev, "%s attempting to register as slave dev when %s already present\n",
   473				   slave_dev->name,
   474				   slave_is_standby ? "standby" : "primary");
   475			goto done;
   476		}
   477	
   478		/* We want to allow only a direct attached VF device as a primary
   479		 * netdev. As there is no easy way to check for a VF device, restrict
   480		 * this to a pci device.
   481		 */
   482		if (!slave_is_standby && (!slave_dev->dev.parent ||
   483					  !dev_is_pci(slave_dev->dev.parent)))
   484			goto done;
   485	
   486		if (failover_dev->features & NETIF_F_VLAN_CHALLENGED &&
   487		    vlan_uses_dev(failover_dev)) {
   488			netdev_err(failover_dev, "Device %s is VLAN challenged and failover device has VLAN set up\n",
   489				   failover_dev->name);
   490			goto done;
   491		}
   492	
   493		/* Align MTU of slave with failover dev */
   494		orig_mtu = slave_dev->mtu;
   495		err = dev_set_mtu(slave_dev, failover_dev->mtu);
   496		if (err) {
   497			netdev_err(failover_dev, "unable to change mtu of %s to %u register failed\n",
   498				   slave_dev->name, failover_dev->mtu);
   499			goto done;
   500		}
   501	
   502		dev_hold(slave_dev);
   503	
   504		if (netif_running(failover_dev)) {
   505			err = dev_open(slave_dev);
   506			if (err && (err != -EBUSY)) {
   507				netdev_err(failover_dev, "Opening slave %s failed err:%d\n",
   508					   slave_dev->name, err);
   509				goto err_dev_open;
   510			}
   511		}
   512	
   513		netif_addr_lock_bh(failover_dev);
   514		dev_uc_sync_multiple(slave_dev, failover_dev);
   515		dev_uc_sync_multiple(slave_dev, failover_dev);
   516		netif_addr_unlock_bh(failover_dev);
   517	
   518		err = vlan_vids_add_by_dev(slave_dev, failover_dev);
   519		if (err) {
   520			netdev_err(failover_dev, "Failed to add vlan ids to device %s err:%d\n",
   521				   slave_dev->name, err);
   522			goto err_vlan_add;
   523		}
   524	
   525		err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
   526						 failover_dev);
   527		if (err) {
   528			netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
   529				   err);
   530			goto err_handler_register;
   531		}
   532	
   533		err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
   534		if (err) {
   535			netdev_err(slave_dev, "can not set failover device %s (err = %d)\n",
   536				   failover_dev->name, err);
   537			goto err_upper_link;
   538		}
   539	
   540		slave_dev->priv_flags |= IFF_FAILOVER_SLAVE;
   541	
   542		if (slave_is_standby) {
   543			rcu_assign_pointer(nfo_info->standby_dev, slave_dev);
 > 544			dev_get_stats(nfo_info->standby_dev, &nfo_info->standby_stats);
   545		} else {
   546			rcu_assign_pointer(nfo_info->primary_dev, slave_dev);
   547			dev_get_stats(nfo_info->primary_dev, &nfo_info->primary_stats);
   548			failover_dev->min_mtu = slave_dev->min_mtu;
   549			failover_dev->max_mtu = slave_dev->max_mtu;
   550		}
   551	
   552		net_failover_compute_features(failover_dev);
   553	
   554		call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
   555	
   556		netdev_info(failover_dev, "failover %s slave:%s registered\n",
   557			    slave_is_standby ? "standby" : "primary", slave_dev->name);
   558	
   559		goto done;
   560	
   561	err_upper_link:
   562		netdev_rx_handler_unregister(slave_dev);
   563	err_handler_register:
   564		vlan_vids_del_by_dev(slave_dev, failover_dev);
   565	err_vlan_add:
   566		dev_uc_unsync(slave_dev, failover_dev);
   567		dev_mc_unsync(slave_dev, failover_dev);
   568		dev_close(slave_dev);
   569	err_dev_open:
   570		dev_put(slave_dev);
   571		dev_set_mtu(slave_dev, orig_mtu);
   572	done:
   573		return NOTIFY_DONE;
   574	}
   575	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply

* Hello from Lisa
From: Lisa Johnson @ 2018-04-27 12:44 UTC (permalink / raw)


Hello dear,
I am Miss Lisa. I have very important thing to discuss with you
please, this information is very vital. Contact me with my private
email so we can talk (lisajohnsonsalimanto@hotmail.com )
Lisa.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox