Netdev List
 help / color / mirror / Atom feed
* Re: build failure of next-20220811 due to 332f1795ca20 ("Bluetooth: L2CAP: Fix l2cap_global_chan_by_psm regression")
From: Sudip Mukherjee @ 2022-08-12  4:09 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Luiz Augusto von Dentz, Johan Hedberg, Marcel Holtmann,
	David S. Miller, Eric Dumazet, Paolo Abeni, linux-kernel, netdev,
	linux-bluetooth, linux-next, Thomas Bogendoerfer, linux-mips
In-Reply-To: <20220811124637.4cdb84f1@kernel.org>

On Thu, Aug 11, 2022 at 8:46 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 11 Aug 2022 19:53:04 +0100 Sudip Mukherjee (Codethink) wrote:
> > Not sure if it has been reported, builds of csky and mips allmodconfig
> > failed to build next-20220811 with gcc-12.
>
> I can't repro with the cross compiler from kernel.org.
> Can you test something like this?

With this patch I get new failure:

In file included from net/bluetooth/l2cap_core.c:37:
./include/net/bluetooth/bluetooth.h: In function 'ba_is_any':
./include/net/bluetooth/bluetooth.h:346:16: error: returning 'void *'
from a function with return type 'int' makes integer from pointer
without a cast [-Werror=int-conversion]
  346 |         return memchr_inv(ba, sizeof(*ba), 0);

So for a quick test, I modified it a little (just a typecast) which worked.

diff --git a/include/net/bluetooth/bluetooth.h
b/include/net/bluetooth/bluetooth.h
index e72f3b247b5e..19bdd2520070 100644
--- a/include/net/bluetooth/bluetooth.h
+++ b/include/net/bluetooth/bluetooth.h
@@ -341,6 +341,11 @@ static inline bool bdaddr_type_is_le(u8 type)
 #define BDADDR_ANY  (&(bdaddr_t) {{0, 0, 0, 0, 0, 0}})
 #define BDADDR_NONE (&(bdaddr_t) {{0xff, 0xff, 0xff, 0xff, 0xff, 0xff}})

+static inline int ba_is_any(const bdaddr_t *ba)
+{
+       return (int) memchr_inv(ba, sizeof(*ba), 0);
+}
+
 /* Copy, swap, convert BD Address */
 static inline int bacmp(const bdaddr_t *ba1, const bdaddr_t *ba2)
 {
diff --git a/net/bluetooth/l2cap_core.c b/net/bluetooth/l2cap_core.c
index cbe0cae73434..67c5d923bc6c 100644
--- a/net/bluetooth/l2cap_core.c
+++ b/net/bluetooth/l2cap_core.c
@@ -2000,8 +2000,8 @@ static struct l2cap_chan
*l2cap_global_chan_by_psm(int state, __le16 psm,
                        }

                        /* Closest match */
-                       src_any = !bacmp(&c->src, BDADDR_ANY);
-                       dst_any = !bacmp(&c->dst, BDADDR_ANY);
+                       src_any = !ba_is_any(&c->src);
+                       dst_any = !ba_is_any(&c->dst);
                        if ((src_match && dst_any) || (src_any && dst_match) ||
                            (src_any && dst_any))
                                c1 = c;



-- 
Regards
Sudip

^ permalink raw reply related

* Re: Kernel Panic in skb_release_data using genet
From: Florian Fainelli @ 2022-08-12  3:33 UTC (permalink / raw)
  To: Maxime Ripard, Florian Fainelli
  Cc: nicolas saenz julienne, Doug Berger, bcm-kernel-feedback-list,
	linux-kernel, netdev
In-Reply-To: <20220517075254.5sctk4hgomjvnuxg@houat>



On 5/17/2022 12:52 AM, Maxime Ripard wrote:
> It's not really 100% reliable, but happens 30%-50% of the time at boot
> when KASAN is enabled. It seems like enabling KASAN increases that
> likelihood though, it went unnoticed for some time before I started
> having those issues again when I enabled it for something unrelated.
> 
> It looks like it happens in bursts though, so I would get 10-15 boots
> fine, and then 4-5 boots with that crash.
> 
> Cold boot vs reboot doesn't seem to affect it in one way or the other.
> 
>> What version of GCC did you build your kernel with?
> 
> The arm64 cross-compiler packaged by Fedora, which is GCC 11.2
> at the moment.
> 
>> How often does that happen? What config.txt file are you using
>> for your Pi4 B?
> 
> You'll find my config.txt and kernel .config attached

OK, so this is what I have been able to reproduce so far but this does 
not appear to be very reliable to reproduce, I will try my best to hold 
on to that lead though, thanks for your patience.

# udhcpc -i eth0
udhcpc: started, v1.35.0
[   34.355086] bcmgenet fd580000.ethernet: configuring instance for 
external RGMII (RX delay)
[   34.363758] 
==================================================================
[   34.371106] BUG: KASAN: user-memory-access in put_page+0x10/0x64
[   34.377227] Read of size 4 at addr 01000085 by task ifconfig/165
[   34.383338]
[   34.384857] CPU: 0 PID: 165 Comm: ifconfig Tainted: G        W 
   5.19.0 #43
[   34.392560] Hardware name: BCM2711
[   34.396020]  unwind_backtrace from show_stack+0x18/0x1c
[   34.401354]  show_stack from dump_stack_lvl+0x40/0x4c
[   34.406502]  dump_stack_lvl from kasan_report+0x8c/0xa4
[   34.411825]  kasan_report from put_page+0x10/0x64
[   34.416615]  put_page from skb_release_data+0x84/0x13c
[   34.421847]  skb_release_data from __kfree_skb+0x14/0x20
[   34.427256]  __kfree_skb from bcmgenet_rx_poll+0x504/0x6f8
[   34.432846]  bcmgenet_rx_poll from __napi_poll.constprop.0+0x50/0x1c0
[   34.439407]  __napi_poll.constprop.0 from net_rx_action+0x278/0x488
[   34.445787]  net_rx_action from __do_softirq+0x268/0x390
[   34.451197]  __do_softirq from __irq_exit_rcu+0x88/0xf8
[   34.456521]  __irq_exit_rcu from irq_exit+0x10/0x18
[   34.461492]  irq_exit from call_with_stack+0x18/0x20
[   34.466553]  call_with_stack from __irq_svc+0x84/0x94
[   34.471696] Exception stack(0xf0d337f8 to 0xf0d33840)
[   34.476835] 37e0: 
   c5548580 00000003
[   34.485156] 3800: 00002000 f0a40808 c5548000 c5548580 00000000 
c554b000 c5548580 c554bdd0
[   34.493474] 3820: 00000000 00000004 c5548580 f0d33848 c094329c 
c09432bc 00070013 ffffffff
[   34.501788]  __irq_svc from bcmgenet_open+0xe1c/0x1094
[   34.507023]  bcmgenet_open from __dev_open+0x1e4/0x21c
[   34.512258]  __dev_open from __dev_change_flags+0x228/0x25c
[   34.517931]  __dev_change_flags from dev_change_flags+0x48/0x88
[   34.523958]  dev_change_flags from devinet_ioctl+0x3ac/0x834
[   34.529723]  devinet_ioctl from inet_ioctl+0x250/0x2a4
[   34.534956]  inet_ioctl from sock_ioctl+0x1dc/0x410
[   34.539927]  sock_ioctl from vfs_ioctl+0x50/0x64
[   34.544632]  vfs_ioctl from sys_ioctl+0x134/0xa7c
[   34.549422]  sys_ioctl from ret_fast_syscall+0x0/0x4c
[   34.554565] Exception stack(0xf0d33fa8 to 0xf0d33ff0)
[   34.559705] 3fa0:                   0051fd98 0053f9dc 00000003 
00008914 b6dc5c4c b6dc5bd0
[   34.568025] 3fc0: 0051fd98 0053f9dc b6dc5f55 00000036 b6dc5e48 
00000003 aed11d00 aed12010
[   34.576341] 3fe0: 00000036 b6dc5bb8 aec4c2f3 aebdda66
[   34.581475] 
==================================================================
[   34.588882] Disabling lock debugging due to kernel taint
[   34.594288] 8<--- cut here ---
[   34.597412] Unable to handle kernel paging request at virtual address 
01000085
[   34.604775] [01000085] *pgd=01982003, *pmd=00000000
[   34.609751] Internal error: Oops: 206 [#1] SMP ARM
[   34.614624] Modules linked in:
[   34.617734] CPU: 0 PID: 165 Comm: ifconfig Tainted: G    B   W 
   5.19.0 #43
[   34.625435] Hardware name: BCM2711
[   34.628892] PC is at put_page+0x14/0x64
[   34.632800] LR is at kasan_report+0x98/0xa4
[   34.637056] pc : [<c0b4bee4>]    lr : [<c047ea5c>]    psr: 60070113
[   34.643427] sp : f0803d50  ip : 00000000  fp : c554bfd8
[   34.648739] r10: 00007f5e  r9 : c694f582  r8 : c1fef15e
[   34.654052] r7 : c694f5b8  r6 : c694f580  r5 : 01000081  r4 : c1fef100
[   34.660689] r3 : 00000000  r2 : c1f047c0  r1 : 00000004  r0 : 00000001
[   34.667325] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM 
Segment user
[   34.674582] Control: 30c5383d  Table: 0606b700  DAC: fffffffd
[   34.680422] Register r0 information: non-paged memory
[   34.685565] Register r1 information: non-paged memory
[   34.690705] Register r2 information: slab task_struct start c1f047c0 
pointer offset 0
[   34.698690] Register r3 information: NULL pointer
[   34.703477] Register r4 information: slab skbuff_head_cache start 
c1fef100 pointer offset 0 size 48
[   34.712699] Register r5 information: non-paged memory
[   34.717839] Register r6 information: non-slab/vmalloc memory
[   34.723595] Register r7 information: non-slab/vmalloc memory
[   34.729352] Register r8 information: slab skbuff_head_cache start 
c1fef100 pointer offset 94 size 48
[   34.738662] Register r9 information: non-slab/vmalloc memory
[   34.744419] Register r10 information: non-paged memory
[   34.749646] Register r11 information: non-slab/vmalloc memory
[   34.755492] Register r12 information: NULL pointer
[   34.760366] Process ifconfig (pid: 165, stack limit = 0xf517d551)
[   34.766573] Stack: (0xf0803d50 to 0xf0804000)
[   34.771005] 3d40:                                     c1fef100 
00000001 c694f580 c0b4dc74
[   34.779325] 3d60: c1fef100 c5548000 c5548580 c1fef100 f0803e40 
7f5e0001 00007f5e c0b4db24
[   34.787644] 3d80: c554bdd0 c0940f84 0bc80000 b4c23195 c2cb12c0 
c0efdab0 c2cb12c0 00000001
[   34.795963] 3da0: 00000000 00000040 00000004 c554bec4 1e1007bc 
c554beb8 c5548588 00000004
[   34.804282] 3dc0: c55498bc c554bec8 c02d5684 00000003 00000000 
c02b6e10 e7df0980 c02bf390
[   34.812601] 3de0: 41b58ab3 c15fec7a c0940a80 c1f047c0 00070113 
257ac000 e7de97cc ffff982d
[   34.820919] 3e00: 00000000 00000000 00000000 00000000 00000000 
00000000 00000000 b4c23195
[   34.829237] 3e20: c1f047c0 e7de8680 00000000 c1f047c0 00000000 
c076733c e7de9ad8 00000000
[   34.837556] 3e40: e7de97d4 c613e0a0 00000001 c554bdd0 00000001 
00000040 f0803ef0 c554bdd8
[   34.845875] 3e60: 257ac000 c2805d40 e7df0d00 c0b70f24 c554bdd0 
f0803ef0 00000000 e7df0b40
[   34.854195] 3e80: f0803f60 bd1007d8 c554bdd0 c2644b40 257ac000 
c0b7130c 0000012c e7df0d0c
[   34.862513] 3ea0: ffff9839 f0803ef0 81d99054 c554bdd4 0000002c 
257ac000 c26433c8 c0840554
[   34.870832] 3ec0: 41b58ab3 c1612850 c0b71094 c2cb12c0 e7df0980 
c02d8a5c ea8ed400 c02d8ae0
[   34.879150] 3ee0: 41b58ab3 c15f3580 c08403c4 00000010 c554bd00 
c554bdd8 00000000 00000010
[   34.887470] 3f00: f0803f00 f0803f00 c5548580 00002000 c554bdd0 
c554b580 0000010a c093e0b8
[   34.895788] 3f20: f0803f20 f0803f20 0000002c c093df98 c2806f18 
c029f4ac 00000000 00000007
[   34.904108] 3f40: e7de9780 c02a4218 00000104 c4dca800 00000001 
c4dca824 c4dca86c c4dca86c
[   34.912427] 3f60: c4dca848 f0803fc8 f0d337f0 b4c23195 c4dca800 
c1f047c0 c280508c 00000008
[   34.920747] 3f80: c2643dc0 c1f047c4 00000003 00000100 c1f049d4 
c02014d8 c4dca800 c1f047c0
[   34.929066] 3fa0: 00400100 0000000a ffff9838 00000004 c263c3c8 
257ac000 c26433c0 c1f047c0
[   34.937385] 3fc0: c2643dc0 c1f047c4 257ac000 257ac000 c1f047c0 
00000000 f0d337f0 c02312c4
[   34.945704] 3fe0: c09432bc 00070013 ffffffff f0d3382c c5548580 
c0231418 c09432bc c07559fc
[   34.954019]  put_page from skb_release_data+0x84/0x13c
[   34.959252]  skb_release_data from __kfree_skb+0x14/0x20
[   34.964660]  __kfree_skb from bcmgenet_rx_poll+0x504/0x6f8
[   34.970250]  bcmgenet_rx_poll from __napi_poll.constprop.0+0x50/0x1c0
[   34.976812]  __napi_poll.constprop.0 from net_rx_action+0x278/0x488
[   34.983192]  net_rx_action from __do_softirq+0x268/0x390
[   34.988602]  __do_softirq from __irq_exit_rcu+0x88/0xf8
[   34.993927]  __irq_exit_rcu from irq_exit+0x10/0x18
[   34.998899]  irq_exit from call_with_stack+0x18/0x20
[   35.003958]  call_with_stack from __irq_svc+0x84/0x94
[   35.009101] Exception stack(0xf0d337f8 to 0xf0d33840)
[   35.014238] 37e0: 
   c5548580 00000003
[   35.022557] 3800: 00002000 f0a40808 c5548000 c5548580 00000000 
c554b000 c5548580 c554bdd0
[   35.030877] 3820: 00000000 00000004 c5548580 f0d33848 c094329c 
c09432bc 00070013 ffffffff
[   35.039192]  __irq_svc from bcmgenet_open+0xe1c/0x1094
[   35.044427]  bcmgenet_open from __dev_open+0x1e4/0x21c
[   35.049661]  __dev_open from __dev_change_flags+0x228/0x25c
[   35.055334]  __dev_change_flags from dev_change_flags+0x48/0x88
[   35.061361]  dev_change_flags from devinet_ioctl+0x3ac/0x834
[   35.067125]  devinet_ioctl from inet_ioctl+0x250/0x2a4
[   35.072359]  inet_ioctl from sock_ioctl+0x1dc/0x410
[   35.077330]  sock_ioctl from vfs_ioctl+0x50/0x64
[   35.082034]  vfs_ioctl from sys_ioctl+0x134/0xa7c
[   35.086825]  sys_ioctl from ret_fast_syscall+0x0/0x4c
[   35.091969] Exception stack(0xf0d33fa8 to 0xf0d33ff0)
[   35.097109] 3fa0:                   0051fd98 0053f9dc 00000003 
00008914 b6dc5c4c b6dc5bd0
[   35.105428] 3fc0: 0051fd98 0053f9dc b6dc5f55 00000036 b6dc5e48 
00000003 aed11d00 aed12010
[   35.113744] 3fe0: 00000036 b6dc5bb8 aec4c2f3 aebdda66
[   35.118883] Code: e1a05000 e2800004 ebe4cca7 e3a01004 (e5953004)
[   35.125104] ---[ end trace 0000000000000000 ]---
[   35.129801] Kernel panic - not syncing: Fatal exception in interrupt
[   35.136260] CPU3: stopping
[   35.139009] CPU: 3 PID: 27 Comm: migration/3 Tainted: G    B D W 
     5.19.0 #43
[   35.146872] Hardware name: BCM2711
[   35.150318] Stopper: multi_cpu_stop+0x0/0x140 <- 
stop_machine_cpuslocked+0x180/0x1e4
[   35.158197]  unwind_backtrace from show_stack+0x18/0x1c
[   35.163509]  show_stack from dump_stack_lvl+0x40/0x4c
[   35.168643]  dump_stack_lvl from do_handle_IPI+0x150/0x2a8
[   35.174218]  do_handle_IPI from ipi_handler+0x1c/0x28
[   35.179351]  ipi_handler from handle_percpu_devid_irq+0x94/0x150
[   35.185454]  handle_percpu_devid_irq from handle_irq_desc+0x38/0x48
[   35.191820]  handle_irq_desc from gic_handle_irq+0x6c/0x78
[   35.197393]  gic_handle_irq from generic_handle_arch_irq+0x28/0x3c
[   35.203671]  generic_handle_arch_irq from call_with_stack+0x18/0x20
[   35.210038]  call_with_stack from __irq_svc+0x84/0x94
[   35.215168] Exception stack(0xf0913e98 to 0xf0913ee0)
[   35.220293] 3e80: 
   e7e20a10 00000000
[   35.228594] 3ea0: 00000000 257dc000 e7e1ec68 f0913ee8 257dc000 
00000000 c2806f18 60070013
[   35.236896] 3ec0: f0863d70 f0863d74 f0863d70 f0913ee8 c02bebd4 
c02bebe8 60070013 ffffffff
[   35.245192]  __irq_svc from rcu_momentary_dyntick_idle+0x2c/0x9c
[   35.251296]  rcu_momentary_dyntick_idle from multi_cpu_stop+0xd4/0x140
[   35.257931]  multi_cpu_stop from cpu_stopper_thread+0x120/0x1d8
[   35.263947]  cpu_stopper_thread from smpboot_thread_fn+0x25c/0x264
[   35.270228]  smpboot_thread_fn from kthread+0x12c/0x140
[   35.275539]  kthread from ret_from_fork+0x14/0x1c
[   35.280317] Exception stack(0xf0913fb0 to 0xf0913ff8)
[   35.285441] 3fa0:                                     00000000 
00000000 00000000 00000000
[   35.293739] 3fc0: 00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000
[   35.302037] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[   35.308746] CPU2: stopping
[   35.311492] CPU: 2 PID: 22 Comm: migration/2 Tainted: G    B D W 
     5.19.0 #43
[   35.319355] Hardware name: BCM2711
[   35.322803] Stopper: multi_cpu_stop+0x0/0x140 <- 
stop_machine_cpuslocked+0x180/0x1e4
[   35.330677]  unwind_backtrace from show_stack+0x18/0x1c
[   35.335988]  show_stack from dump_stack_lvl+0x40/0x4c
[   35.341122]  dump_stack_lvl from do_handle_IPI+0x150/0x2a8
[   35.346697]  do_handle_IPI from ipi_handler+0x1c/0x28
[   35.351830]  ipi_handler from handle_percpu_devid_irq+0x94/0x150
[   35.357932]  handle_percpu_devid_irq from handle_irq_desc+0x38/0x48
[   35.364298]  handle_irq_desc from gic_handle_irq+0x6c/0x78
[   35.369870]  gic_handle_irq from generic_handle_arch_irq+0x28/0x3c
[   35.376148]  generic_handle_arch_irq from call_with_stack+0x18/0x20
[   35.382515]  call_with_stack from __irq_svc+0x84/0x94
[   35.387646] Exception stack(0xf08ebea8 to 0xf08ebef0)
[   35.392773] bea0:                   f0863d70 00000003 00000000 
00000001 f0863d60 00000000
[   35.401074] bec0: 00000001 00000000 c2806f18 600c0013 f0863d70 
f0863d74 f0863d70 f08ebef8
[   35.409372] bee0: c030acac c02bebbc 600c0013 ffffffff
[   35.414495]  __irq_svc from rcu_momentary_dyntick_idle+0x0/0x9c
[   35.420511]  rcu_momentary_dyntick_idle from 0xc31d0000
[   35.425820] CPU1: stopping
[   35.428568] CPU: 1 PID: 17 Comm: migration/1 Tainted: G    B D W 
     5.19.0 #43
[   35.436430] Hardware name: BCM2711
[   35.439879] Stopper: multi_cpu_stop+0x0/0x140 <- 
stop_machine_cpuslocked+0x180/0x1e4
[   35.447752]  unwind_backtrace from show_stack+0x18/0x1c
[   35.453064]  show_stack from dump_stack_lvl+0x40/0x4c
[   35.458198]  dump_stack_lvl from do_handle_IPI+0x150/0x2a8
[   35.463772]  do_handle_IPI from ipi_handler+0x1c/0x28
[   35.468905]  ipi_handler from handle_percpu_devid_irq+0x94/0x150
[   35.475006]  handle_percpu_devid_irq from handle_irq_desc+0x38/0x48
[   35.481373]  handle_irq_desc from gic_handle_irq+0x6c/0x78
[   35.486945]  gic_handle_irq from generic_handle_arch_irq+0x28/0x3c
[   35.493222]  generic_handle_arch_irq from call_with_stack+0x18/0x20
[   35.499590]  call_with_stack from __irq_svc+0x84/0x94
[   35.504721] Exception stack(0xf08c3e98 to 0xf08c3ee0)
[   35.509847] 3e80: 
   e7e00a10 00000000
[   35.518148] 3ea0: 00000000 257bc000 e7dfec68 f08c3ee8 257bc000 
00000000 c2806f18 600f0013
[   35.526449] 3ec0: f0863d70 f0863d74 f0863d70 f08c3ee8 c02bebd4 
c02bebe8 600f0013 ffffffff
[   35.534745]  __irq_svc from rcu_momentary_dyntick_idle+0x2c/0x9c
[   35.540849]  rcu_momentary_dyntick_idle from multi_cpu_stop+0xd4/0x140
[   35.547483]  multi_cpu_stop from cpu_stopper_thread+0x120/0x1d8
[   35.553499]  cpu_stopper_thread from smpboot_thread_fn+0x25c/0x264
[   35.559780]  smpboot_thread_fn from kthread+0x12c/0x140
[   35.565090]  kthread from ret_from_fork+0x14/0x1c
[   35.569868] Exception stack(0xf08c3fb0 to 0xf08c3ff8)
[   35.574992] 3fa0:                                     00000000 
00000000 00000000 00000000
[   35.583292] 3fc0: 00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000
[   35.591589] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[   35.599291] ---[ end Kernel panic - not syncing: Fatal exception in 
interrupt ]---
-- 
Florian

^ permalink raw reply

* Re: [PATCH] bonding: return -ENOMEM on rlb_initialize() allocation failure
From: Jay Vosburgh @ 2022-08-12  3:29 UTC (permalink / raw)
  To: Jiapeng Chong
  Cc: vfalico, andy, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel, Abaci Robot
In-Reply-To: <20220812032059.64572-1-jiapeng.chong@linux.alibaba.com>

Jiapeng Chong <jiapeng.chong@linux.alibaba.com> wrote:

>drivers/net/bonding/bond_alb.c:861 rlb_initialize() warn: returning -1 instead of -ENOMEM is sloppy.

	I'll disagree; the return value is only ever tested for being
non-zero.

	-J

>Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1896
>Reported-by: Abaci Robot <abaci@linux.alibaba.com>
>Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
>---
> drivers/net/bonding/bond_alb.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index 60cb9a0225aa..96cb4404b3c7 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -858,7 +858,7 @@ static int rlb_initialize(struct bonding *bond)
> 
> 	new_hashtbl = kmalloc(size, GFP_KERNEL);
> 	if (!new_hashtbl)
>-		return -1;
>+		return -ENOMEM;
> 
> 	spin_lock_bh(&bond->mode_lock);
> 
>-- 
>2.20.1.7.g153144c
>

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply

* [PATCH] bonding: return -ENOMEM on rlb_initialize() allocation failure
From: Jiapeng Chong @ 2022-08-12  3:20 UTC (permalink / raw)
  To: j.vosburgh
  Cc: vfalico, andy, davem, edumazet, kuba, pabeni, netdev,
	linux-kernel, Jiapeng Chong, Abaci Robot

drivers/net/bonding/bond_alb.c:861 rlb_initialize() warn: returning -1 instead of -ENOMEM is sloppy.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=1896
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
---
 drivers/net/bonding/bond_alb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
index 60cb9a0225aa..96cb4404b3c7 100644
--- a/drivers/net/bonding/bond_alb.c
+++ b/drivers/net/bonding/bond_alb.c
@@ -858,7 +858,7 @@ static int rlb_initialize(struct bonding *bond)
 
 	new_hashtbl = kmalloc(size, GFP_KERNEL);
 	if (!new_hashtbl)
-		return -1;
+		return -ENOMEM;
 
 	spin_lock_bh(&bond->mode_lock);
 
-- 
2.20.1.7.g153144c


^ permalink raw reply related

* [PATCH] net: lan966x: fix checking for return value of platform_get_irq_byname()
From: Li Qiong @ 2022-08-12  3:09 UTC (permalink / raw)
  To: Horatiu Vultur, UNGLinuxDriver, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni
  Cc: netdev, linux-kernel, yuzhe, renyu, jiaming, kernel-janitors,
	Li Qiong

The platform_get_irq_byname() returns non-zero IRQ number
or negative error number. "if (irq)" always true, chang it
to "if (irq > 0)"

Signed-off-by: Li Qiong <liqiong@nfschina.com>
---
 drivers/net/ethernet/microchip/lan966x/lan966x_main.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/microchip/lan966x/lan966x_main.c b/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
index 1d6e3b641b2e..d928b75f3780 100644
--- a/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
+++ b/drivers/net/ethernet/microchip/lan966x/lan966x_main.c
@@ -710,7 +710,7 @@ static void lan966x_cleanup_ports(struct lan966x *lan966x)
 	disable_irq(lan966x->xtr_irq);
 	lan966x->xtr_irq = -ENXIO;
 
-	if (lan966x->ana_irq) {
+	if (lan966x->ana_irq > 0) {
 		disable_irq(lan966x->ana_irq);
 		lan966x->ana_irq = -ENXIO;
 	}
@@ -718,10 +718,10 @@ static void lan966x_cleanup_ports(struct lan966x *lan966x)
 	if (lan966x->fdma)
 		devm_free_irq(lan966x->dev, lan966x->fdma_irq, lan966x);
 
-	if (lan966x->ptp_irq)
+	if (lan966x->ptp_irq > 0)
 		devm_free_irq(lan966x->dev, lan966x->ptp_irq, lan966x);
 
-	if (lan966x->ptp_ext_irq)
+	if (lan966x->ptp_ext_irq > 0)
 		devm_free_irq(lan966x->dev, lan966x->ptp_ext_irq, lan966x);
 }
 
@@ -1049,7 +1049,7 @@ static int lan966x_probe(struct platform_device *pdev)
 	}
 
 	lan966x->ana_irq = platform_get_irq_byname(pdev, "ana");
-	if (lan966x->ana_irq) {
+	if (lan966x->ana_irq > 0) {
 		err = devm_request_threaded_irq(&pdev->dev, lan966x->ana_irq, NULL,
 						lan966x_ana_irq_handler, IRQF_ONESHOT,
 						"ana irq", lan966x);
-- 
2.11.0


^ permalink raw reply related

* [PATCH net-next v2] net: skb: prevent the split of kfree_skb_reason() by gcc
From: menglong8.dong @ 2022-08-12  2:50 UTC (permalink / raw)
  To: kuba, miguel.ojeda.sandonis
  Cc: ojeda, ndesaulniers, davem, edumazet, pabeni, asml.silence,
	imagedong, luiz.von.dentz, vasily.averin, jk, linux-kernel,
	netdev

From: Menglong Dong <imagedong@tencent.com>

Sometimes, gcc will optimize the function by spliting it to two or
more functions. In this case, kfree_skb_reason() is splited to
kfree_skb_reason and kfree_skb_reason.part.0. However, the
function/tracepoint trace_kfree_skb() in it needs the return address
of kfree_skb_reason().

This split makes the call chains becomes:
  kfree_skb_reason() -> kfree_skb_reason.part.0 -> trace_kfree_skb()

which makes the return address that passed to trace_kfree_skb() be
kfree_skb().

Therefore, prevent this kind of optimization to kfree_skb_reason() by
making the optimize level to "O1". I think these should be better
method instead of this "O1", but I can't figure it out......

This optimization CAN happen, which depend on the behavior of gcc.
I'm not able to reproduce it in the latest kernel code, but it happens
in my kernel of version 5.4.119. Maybe the latest code already do someting
that prevent this happen?

Signed-off-by: Menglong Dong <imagedong@tencent.com>
---
v2:
- replace 'optimize' with '__optimize__' in __nofnsplit, as Miguel Ojeda
  advised.
---
 include/linux/compiler_attributes.h | 2 ++
 net/core/skbuff.c                   | 3 ++-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h
index 445e80517cab..b910b5775fc7 100644
--- a/include/linux/compiler_attributes.h
+++ b/include/linux/compiler_attributes.h
@@ -270,6 +270,8 @@
  */
 #define __noreturn                      __attribute__((__noreturn__))
 
+#define __nofnsplit                     __attribute__((__optimize__("O1")))
+
 /*
  * Optional: not supported by gcc.
  * Optional: not supported by icc.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 974bbbbe7138..ff9ccbc032b9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -777,7 +777,8 @@ EXPORT_SYMBOL(__kfree_skb);
  *	hit zero. Meanwhile, pass the drop reason to 'kfree_skb'
  *	tracepoint.
  */
-void kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
+void __nofnsplit
+kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
 {
 	if (!skb_unref(skb))
 		return;
-- 
2.36.1


^ permalink raw reply related

* [PATCH bpf-next] libbpf: making bpf_prog_load() ignore name if kernel doesn't support
From: Hangbin Liu @ 2022-08-12  2:40 UTC (permalink / raw)
  To: netdev
  Cc: Quentin Monnet, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf, Hangbin Liu

Similar with commit 10b62d6a38f7 ("libbpf: Add names for auxiliary maps"),
let's make bpf_prog_load() also ignore name if kernel doesn't support
program name.

To achieve this, we need to call sys_bpf_prog_load() directly in
probe_kern_prog_name() to avoid circular dependency. sys_bpf_prog_load()
also need to be exported in the bpf.h file.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 tools/lib/bpf/bpf.c    |  6 ++----
 tools/lib/bpf/bpf.h    |  3 +++
 tools/lib/bpf/libbpf.c | 11 +++++++++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 6a96e665dc5d..575867d69496 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -84,9 +84,7 @@ static inline int sys_bpf_fd(enum bpf_cmd cmd, union bpf_attr *attr,
 	return ensure_good_fd(fd);
 }
 
-#define PROG_LOAD_ATTEMPTS 5
-
-static inline int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int attempts)
+int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int attempts)
 {
 	int fd;
 
@@ -263,7 +261,7 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
 	attr.prog_ifindex = OPTS_GET(opts, prog_ifindex, 0);
 	attr.kern_version = OPTS_GET(opts, kern_version, 0);
 
-	if (prog_name)
+	if (prog_name && kernel_supports(NULL, FEAT_PROG_NAME))
 		libbpf_strlcpy(attr.prog_name, prog_name, sizeof(attr.prog_name));
 	attr.license = ptr_to_u64(license);
 
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 9c50beabdd14..125c580e45f8 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -35,6 +35,9 @@
 extern "C" {
 #endif
 
+#define PROG_LOAD_ATTEMPTS 5
+int sys_bpf_prog_load(union bpf_attr *attr, unsigned int size, int attempts);
+
 int libbpf_set_memlock_rlim(size_t memlock_bytes);
 
 struct bpf_map_create_opts {
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 3f01f5cd8a4c..1bcb2735d3f1 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -4419,10 +4419,17 @@ static int probe_kern_prog_name(void)
 		BPF_MOV64_IMM(BPF_REG_0, 0),
 		BPF_EXIT_INSN(),
 	};
-	int ret, insn_cnt = ARRAY_SIZE(insns);
+	union bpf_attr attr = {
+		.prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
+		.prog_name = "test",
+		.license = ptr_to_u64("GPL"),
+		.insns = ptr_to_u64(insns),
+		.insn_cnt = (__u32)ARRAY_SIZE(insns),
+	};
+	int ret;
 
 	/* make sure loading with name works */
-	ret = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, "test", "GPL", insns, insn_cnt, NULL);
+	ret = sys_bpf_prog_load(&attr, sizeof(attr), PROG_LOAD_ATTEMPTS);
 	return probe_fd(ret);
 }
 
-- 
2.31.1


^ permalink raw reply related

* Re: [RFCv7 PATCH net-next 01/36] net: introduce operation helpers for netdev features
From: shenjian (K) @ 2022-08-12  2:39 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: davem, kuba, andrew, ecree.xilinx, hkallweit1, saeed, leon,
	netdev, linuxarm
In-Reply-To: <20220811104936.3675-1-alexandr.lobakin@intel.com>



在 2022/8/11 18:49, Alexander Lobakin 写道:
> From: "shenjian (K)" <shenjian15@huawei.com>
> Date: Wed, 10 Aug 2022 19:32:28 +0800
>
>> 在 2022/8/10 17:43, Alexander Lobakin 写道:
>> > From: Jian Shen <shenjian15@huawei.com>
>> > Date: Wed, 10 Aug 2022 11:05:49 +0800
>> >
>> >> Introduce a set of bitmap operation helpers for netdev features,
>> >> then we can use them to replace the logical operation with them.
>> >>
>> >> The implementation of these helpers are based on the old prototype
>> >> of netdev_features_t is still u64. These helpers will be rewritten
>> >> on the last patch, when the prototype changes.
>> >>
>> >> To avoid interdependencies between netdev_features_helper.h and
>> >> netdevice.h, put the helpers for testing feature in the netdevice.h,
>> >> and move advandced helpers like netdev_get_wanted_features() and
>> >> netdev_intersect_features() to netdev_features_helper.h.
>> >>
>> >> Signed-off-by: Jian Shen <shenjian15@huawei.com>
>> >> ---
>> >>   include/linux/netdev_features.h        |  11 +
>> >>   include/linux/netdev_features_helper.h | 707 
>> +++++++++++++++++++++++++
>> > 'netdev_feature_helpers.h' fits more I guess, doesn't it? It
>> > contains several helpers, not only one.
>> ok, will rename it.
>>
>> > And BTW, do you think it's worth to create a new file rather than
>> > put everything just in netdev_features.h?
>> Jakub suggested me to move them to a new file, then it can be includued
>> at users appropriately. 
>> [https://www.spinics.net/lists/netdev/msg809370.html]
>>
>> And it's unable to put everything in netdev_features.h, because these 
>> helpers
>> need to see the definition of struct net_device which is defined in 
>> netdevice.h.
>> It leading interdependence for netdeice.h include netdev_features.h.
>
> Ah, correct then, sure! I missed that fact.
>
>>
>>
>> >>   include/linux/netdevice.h              |  45 +-
>> >>   net/8021q/vlan_dev.c                   |   1 +
>> >>   net/core/dev.c                         |   1 +
>> >>   5 files changed, 747 insertions(+), 18 deletions(-)
>> >>   create mode 100644 include/linux/netdev_features_helper.h
>> >>
>> >> diff --git a/include/linux/netdev_features.h 
>> b/include/linux/netdev_features.h
>> >> index 7c2d77d75a88..9d434b4e6e6e 100644
>> >> --- a/include/linux/netdev_features.h
>> >> +++ b/include/linux/netdev_features.h
>> >> @@ -11,6 +11,17 @@
>> >>   >>   typedef u64 netdev_features_t;
>> >>   >> +struct netdev_feature_set {
>> >> +    unsigned int cnt;
>> >> +    unsigned short feature_bits[];
>> >> +};
>> >> +
>> >> +#define DECLARE_NETDEV_FEATURE_SET(name, features...)            \
>> >> +    const struct netdev_feature_set name = {            \
>> >> +        .cnt = sizeof((unsigned short[]){ features }) / 
>> sizeof(unsigned short),    \
>> >> +        .feature_bits = { features },                \
>> >> +    }
>> >> +
>> >>   enum {
>> >>       NETIF_F_SG_BIT,            /* Scatter/gather IO. */
>> >>       NETIF_F_IP_CSUM_BIT,        /* Can checksum TCP/UDP over 
>> IPv4. */
>> >> diff --git a/include/linux/netdev_features_helper.h 
>> b/include/linux/netdev_features_helper.h
>> >> new file mode 100644
>> >> index 000000000000..5423927d139b
>> >> --- /dev/null
>> >> +++ b/include/linux/netdev_features_helper.h
>> >> @@ -0,0 +1,707 @@
>> >> +/* SPDX-License-Identifier: GPL-2.0-or-later */
>> >> +/*
>> >> + * Network device features helpers.
>> >> + */
>> >> +#ifndef _LINUX_NETDEV_FEATURES_HELPER_H
>> >> +#define _LINUX_NETDEV_FEATURES_HELPER_H
>> >> +
>> >> +#include <linux/netdevice.h>
>> >> +
>> >> +static inline void netdev_features_zero(netdev_features_t *dst)
>> >> +{
>> >> +    *dst = 0;
>> >> +}
>> >> +
>> >> +/* active_feature prefer to netdev->features */
>> >> +#define netdev_active_features_zero(ndev) \
>> >> +        netdev_features_zero(&ndev->features)
>> > netdev_features_t sometimes is being placed and used on the stack.
>> > I think it's better to pass just `netdev_features_t *` to those
>> > helpers, this way you wouldn't also need to create a new helper
>> > for each net_device::*_features.
>> My purpose of defining  helpers for each net_device::*_features is to
>> avoiding driver to change  net_device::*_features directly.
>
> But why? My point is that you have to create a whole bunch of
> copy'n'paste functions differing only by the &net_device field
> name.
>
I noticed that Jakub have done a lot work for avoiding driver to write 
netdev->dev_addr
directly. Also in earlier discuss, Saeed had suggested to hide hide the 
implementation
details and abstract it away from drivers using getters and manipulation 
APIs.
[https://lore.kernel.org/all/b335852ecaba3c86d1745b5021bb500798fc843b.camel@kernel.org/]


>>
>> >> +
>> >> +#define netdev_hw_features_zero(ndev) \
>> >> + netdev_features_zero(&ndev->hw_features)
>
> Oh BTW: wrap `ndev` in the netdev_features_zero() call into braces,
> `netdev_feature_zero(&(ndev)->hw_features)`, otherwise it may cause
> unwanted sneaky logical changes or build failures.
>
OK, will fix it.

>> >> +
>> >> +#define netdev_wanted_features_zero(ndev) \
>> > [...]
>> >
>> >> +#define netdev_gso_partial_features_and(ndev, __features) \
>> >> + netdev_features_and(ndev->gso_partial_features, __features)
>> >> +
>> >> +/* helpers for netdev features '&=' operation */
>> >> +static inline void
>> >> +netdev_features_mask(netdev_features_t *dst,
>> >> +               const netdev_features_t features)
>> >> +{
>> >> +    *dst = netdev_features_and(*dst, features);
>> > A small proposal: if you look at bitmap_and() for example, it
>> > returns 1 if the resulting bitmap is non-empty and 0 if it is. What
>> > about doing the same here? It would probably help to do reduce
>> > boilerplating in the drivers where we only want to know if there's
>> > anything left after masking.
>> > Same for xor, toggle etc.
>> Thanks for point this.  Return whether empty, then I can remove 
>> netdev_features_intersects
>> helpers. But there are also many places to use 'f1 & f2' as return 
>> value or input param, then
>> I need to define more temporay features to store the result, and then 
>> return the temporay
>> features or pass into it.
>
> No, netdev_features_intersects() is okay, leave it as it is. Just
> look on bitmap_*() prototypes and return its values when applicable.
>
OK, will follow the prototypes of bitmap_and and others.


>>
>> >> +}
>> >> +
>> >> +static inline void
>> >> +netdev_active_features_mask(struct net_device *ndev,
>> >> +                const netdev_features_t features)
>> >> +{
>> >> +    ndev->features = netdev_active_features_and(ndev, features);
>> >> +}
>> > [...]
>> >
>> >> +/* helpers for netdev features 'set bit array' operation */
>> >> +static inline void
>> >> +netdev_features_set_array(const struct netdev_feature_set *set,
>> >> +              netdev_features_t *dst)
>> >> +{
>> >> +    int i;
>> >> +
>> >> +    for (i = 0; i < set->cnt; i++)
>> > Nit: kernel is C11 now, you can do just `for (u32 i = 0; i ...`.
>> > (and yeah, it's better to use unsigned types when you don't plan
>> > to store negative values there).
>> ok, will fix it.
>>
>> >> +        netdev_feature_add(set->feature_bits[i], dst);
>> >> +}
>> > [...]
>> >
>> >> -- >> 2.33.0
>> > Thanks,
>> > Olek
>> >
>> > .
>
> Thanks,
> Olek
>
> .
>
Thanks,
Jian


^ permalink raw reply

* Re: [PATCH net-next] net: skb: prevent the split of kfree_skb_reason() by gcc
From: Menglong Dong @ 2022-08-12  2:30 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: kuba, ojeda, ndesaulniers, davem, edumazet, pabeni, asml.silence,
	imagedong, luiz.von.dentz, vasily.averin, jk, linux-kernel,
	netdev
In-Reply-To: <CANiq72noui51tmbhySEH1B=cRJm2JgNMGPboLoguZ+P53whRsA@mail.gmail.com>

On Thu, Aug 11, 2022 at 10:35 PM Miguel Ojeda
<miguel.ojeda.sandonis@gmail.com> wrote:
>
> On Thu, Aug 11, 2022 at 4:34 PM Miguel Ojeda
> <miguel.ojeda.sandonis@gmail.com> wrote:
> >
> > Two notes on this: please use the double underscore form:
> > `__optimize__` and keep the file sorted (it should go after
> > `__overloadable__`, since we sort by the actual attribute name).
>
> s/after/before
>

Okay......Thanks for your reminds :/

Menglong Dong

> Cheers,
> Miguel

^ permalink raw reply

* Re: [PATCHv3 bpf-next] libbpf: Add names for auxiliary maps
From: Hangbin Liu @ 2022-08-12  1:59 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: netdev, Quentin Monnet, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf
In-Reply-To: <CAEf4Bzb+vfjrZv+3fmg8wmDQc5iBXO+xubKCdL-4BsgxGmuyOg@mail.gmail.com>

On Fri, Aug 12, 2022 at 6:15 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> Applied to bpf-next. But let's also do the same for bpf_prog_load().
> We can make probe_kern_prog_name() use sys_bpf_prog_load() directly
> and avoid calling bpf_prog_load() and thus avoiding circular
> dependency.

Ah, yes, we can do it this way. I will post the patch today.

^ permalink raw reply

* Re: [PATCH vhost 0/2] virtio_net: fix for stuck when change ring size with dev down
From: Xuan Zhuo @ 2022-08-12  1:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: virtualization, Jason Wang, David S. Miller, Eric Dumazet,
	Paolo Abeni, netdev, Michael S. Tsirkin
In-Reply-To: <20220811103730.0f085866@kernel.org>

On Thu, 11 Aug 2022 10:37:30 -0700, Jakub Kicinski <kuba@kernel.org> wrote:
> On Thu, 11 Aug 2022 04:11:22 -0400 Michael S. Tsirkin wrote:
> > Which patches does this fix?
> > Maybe I should squash.
>
> Side question to make sure I understand the terminology - this
> is *not* a vhost patch, right? vhost is the host side of virtio?
> Is the work going via the vhost tree because of some dependencies?


Yes, the commits fixed by this patch are currently in Michael's vhost branch.

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/log/?h=linux-next

So I mean that by "vhost" here, not into the net/net-next branch. Or should I use
a more accurate term next time?

Thanks.

^ permalink raw reply

* Re: [RFC net-next 3/4] ynl: add a sample python library
From: Stephen Hemminger @ 2022-08-12  1:04 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, edumazet, pabeni, sdf, jacob.e.keller, vadfed,
	johannes, jiri, dsahern, fw, linux-doc
In-Reply-To: <20220811022304.583300-4-kuba@kernel.org>

On Wed, 10 Aug 2022 19:23:03 -0700
Jakub Kicinski <kuba@kernel.org> wrote:

> A very short and very incomplete generic python library.
> 
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

It would be great if python had standard module for netlink.
Then your code could just (re)use that.
Something like mnl but for python.


^ permalink raw reply

* Re: [PATCH bpf-next 13/15] mm, memcg: Add new helper get_obj_cgroup_from_cgroup
From: Yafang Shao @ 2022-08-12  0:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Muchun Song, Andrew Morton, netdev, bpf, Linux MM
In-Reply-To: <YvUrXLJF6qrGOdjP@P9FQF9L96D.corp.robot.car>

On Fri, Aug 12, 2022 at 12:16 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Wed, Aug 10, 2022 at 03:18:38PM +0000, Yafang Shao wrote:
> > Introduce new helper get_obj_cgroup_from_cgroup() to get obj_cgroup from
> > a specific cgroup.
> >
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  include/linux/memcontrol.h |  1 +
> >  mm/memcontrol.c            | 41 +++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 42 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 2f0a611..901a921 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1713,6 +1713,7 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg,
> >  int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order);
> >  void __memcg_kmem_uncharge_page(struct page *page, int order);
> >
> > +struct obj_cgroup *get_obj_cgroup_from_cgroup(struct cgroup *cgrp);
> >  struct obj_cgroup *get_obj_cgroup_from_current(void);
> >  struct obj_cgroup *get_obj_cgroup_from_page(struct page *page);
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 618c366..762cffa 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2908,6 +2908,47 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> >       return objcg;
> >  }
> >
> > +static struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> > +{
> > +     struct obj_cgroup *objcg;
> > +
> > +     if (memcg_kmem_bypass())
> > +             return NULL;
> > +
> > +     rcu_read_lock();
> > +     objcg = __get_obj_cgroup_from_memcg(memcg);
> > +     rcu_read_unlock();
> > +     return objcg;
>
> This code doesn't make sense to me. What does rcu read lock protect here?

To protect rcu_dereference(memcg->objcg);.
Doesn't it need the read rcu lock ?

-- 
Regards
Yafang

^ permalink raw reply

* Re: [PATCH bpf-next 05/15] bpf: Fix incorrect mem_cgroup_put
From: Yafang Shao @ 2022-08-12  0:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Muchun Song, Andrew Morton, netdev, bpf, Linux MM
In-Reply-To: <YvUy5IA+XJp7ylIC@P9FQF9L96D.corp.robot.car>

On Fri, Aug 12, 2022 at 12:48 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> On Wed, Aug 10, 2022 at 03:18:30PM +0000, Yafang Shao wrote:
> > The memcg may be the root_mem_cgroup, in which case we shouldn't put it.
> > So a new helper bpf_map_put_memcg() is introduced to pair with
> > bpf_map_get_memcg().
> >
> > Fixes: 4201d9ab3e42 ("bpf: reparent bpf maps on memcg offlining")
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Shakeel Butt <shakeelb@google.com>
> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> > ---
> >  kernel/bpf/syscall.c | 14 +++++++++++---
> >  1 file changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 83c7136..51ab8b1 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -441,6 +441,14 @@ static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
> >       return root_mem_cgroup;
> >  }
> >
> > +static void bpf_map_put_memcg(struct mem_cgroup *memcg)
> > +{
> > +     if (mem_cgroup_is_root(memcg))
> > +             return;
> > +
> > +     mem_cgroup_put(memcg);
> > +}
>
> +1 to what Shakeel said. mem_cgroup_put(root_mem_cgroup) is totally valid.
> So this change does absolutely nothing.
>

Do you mean that we can mem_cgroup_put(root_mem_cgroup) without
mem_cgroup_get(root_mem_cgroup) ?
Am I missing something ?

> The fixes tag assumes there is a bug in the existing code. If so, please,
> describe the problem and how to reproduce it.
>

It is found by code review.  The root_mem_cgroup's css will break. But
I don't know what it may cause to the user.
If you think the fixes tag is unproper, I will remove it.

> Also, if it's not related to the rest of the patchset, please, send it
> separately.
>

I want to introduce a bpf_map_put_memcg() helper to pair with
bpf_map_get_memcg().
This new helper will be used by other patches.

-- 
Regards
Yafang

^ permalink raw reply

* Re: [PATCH bpf-next 05/15] bpf: Fix incorrect mem_cgroup_put
From: Yafang Shao @ 2022-08-12  0:27 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin Lau,
	Song Liu, Yonghong Song, john fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, jolsa, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, netdev, bpf, Linux MM
In-Reply-To: <20220811154731.3smhom6v4qy2u6rd@google.com>

On Thu, Aug 11, 2022 at 11:47 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Aug 11, 2022 at 10:49:13AM +0800, Yafang Shao wrote:
> > On Thu, Aug 11, 2022 at 1:07 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Wed, Aug 10, 2022 at 03:18:30PM +0000, Yafang Shao wrote:
> > > > The memcg may be the root_mem_cgroup, in which case we shouldn't put it.
> > >
> > > No, it is ok to put root_mem_cgroup. css_put already handles the root
> > > cgroups.
> > >
> >
> > Ah, this commit log doesn't describe the issue clearly. I should improve it.
> > The issue is that in bpf_map_get_memcg() it doesn't get the objcg if
> > map->objcg is NULL (that can happen if the map belongs to the root
> > memcg), so we shouldn't put the objcg if map->objcg is NULL neither in
> > bpf_map_put_memcg().
>
> Sorry I am still not understanding. We are not 'getting' objcg in
> bpf_map_get_memcg(). We are 'getting' memcg from map->objcg and if that
> is NULL the function is returning root memcg and putting root memcg is
> totally fine.

When the map belongs to root_mem_cgroup, the map->objcg is NULL, right ?
See also bpf_map_save_memcg() and it describes clearly in the comment -

static void bpf_map_save_memcg(struct bpf_map *map)
{
        /* Currently if a map is created by a process belonging to the root
         * memory cgroup, get_obj_cgroup_from_current() will return NULL.
         * So we have to check map->objcg for being NULL each time it's
         * being used.
         */
        map->objcg = get_obj_cgroup_from_current();
}

So for the root_mem_cgroup case, bpf_map_get_memcg() will return
root_mem_cgroup directly without getting its css, right ? See below,

static struct mem_cgroup *bpf_map_get_memcg(const struct bpf_map *map)
{

        if (map->objcg)
                return get_mem_cgroup_from_objcg(map->objcg);

        return root_mem_cgroup;   // without css_get(&memcg->css);
}

But it will put the css unconditionally.  See below,

memcg = bpf_map_get_memcg(map);
...
mem_cgroup_put(memcg);

So we should put it *conditionally* as well.

  memcg = bpf_map_get_memcg(map);
   ...
+ if (map->objcg)
       mem_cgroup_put(memcg);

Is it clear to you ?

> Or are you saying that root_mem_cgroup is NULL?
>

No

-- 
Regards
Yafang

^ permalink raw reply

* Re: [RFC net-next 4/4] ynl: add a sample user for ethtool
From: Jakub Kicinski @ 2022-08-11 23:31 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, davem, edumazet, pabeni, jacob.e.keller, vadfed, johannes,
	jiri, dsahern, stephen, fw, linux-doc
In-Reply-To: <CAKH8qBs54kX_MjA2xHM1sSa_zvNYDEPhiZcwEVWV4kP1dEPcEw@mail.gmail.com>

On Thu, 11 Aug 2022 15:55:44 -0700 Stanislav Fomichev wrote:
> > We could, I guess. To be clear this controls the count, IOW:
> >
> > enum {
> >         PREFIX_A_BLA_ATTR = 1,
> >         PREFIX_A_ANOTHER_ATTR,
> >         PREFIX_A_AND_ONEMORE,
> >         __PFREIX_A_CNT, // <--- This thing
> > };
> > #define PREFIX_A_MAX (__PFREIX_A_CNT - 1)
> >
> > It's not used in the generated code, only if we codegen the uAPI,
> > AFAIR. So we'd need a way to tell the generator of the uAPI about
> > the situation, anyway. I could be misremembering.  
> 
> My worry is that we'll have more hacks like these and it's hard, as a
> spec reader/writer, to figure out that they exist..
> So I was wondering if it's "easier" (from the spec reader/writer pov)
> to have some c-header-fixup: section where we can have plain
> c-preprocessor hacks like these (where we need to redefine something
> to match the old behavior).

Let me think about it some more. My main motivation is people writing
new families, I haven't sent too much time worrying about the existing
ones with all their quirks. It's entirely possible that the uAPI quirks
can just go and we won't generate uAPI for existing families as it
doesn't buy us anything.

> Coming from stubby/grpc, I was expecting to see words like
> message/field/struct. The question is what's more confusing: sticking
> with netlink naming or trying to map grpc/thrift concepts on top of
> what we have. (I'm assuming more people know about grpc/thrift than
> netlink)
> 
> messages: # or maybe 'attribute-sets' ?
>   - name: channels
>     ...

Still not convinced about messages, as it makes me think that every
"space" is then a definition of a message rather than just container
for field definitions with independent ID spaces. 

Attribute-sets sounds good, happy to rename.

Another thought I just had was to call it something like "data-types"
or "field-types" or "type-spaces". To indicate the split into "data" 
and "actions"/"operations"?

> operations:
>   - name: channel_get
>     message: channels
>     do:
>       request:
>         fields:
>         - header
>         - rx_max
> 
> Or maybe all we really need is a section in the doc called 'Netlink
> for gRPC/Thrift users' where we map these concepts:
> - attribute-spaces (attribute-sets?) -> messages
> - attributes -> fields

Excellent idea!

> > Dunno, that'd mean that the Python method is called
> > ETHTOOL_MSG_CHANNELS_GET rather than just channels_get.
> > I don't want to force all languages to use the C naming.
> > The C naming just leads to silly copy'n'paste issues like
> > f329a0ebeab.  
> 
> Can we have 'name:' and 'long-name:' or 'c-name:' or 'full-name' ?
> 
> - name: header
>    attributes:
>     - name: dev_index
>       full-name: ETHTOOL_A_HEADER_DEV_INDEX
>       val:
>       type:
> 
> Suppose I'm rewriting my c application from uapi to some generated (in
> the future) python-like channels_get() method. If I can grep for
> ETHTOOL_MSG_CHANNELS_GET, that would save me a bunch of time figuring
> out what the new canonical wrapper is.
> 
> Also, maybe, at some point we'll have:
> - name: dev_index
>   c-name: ETHTOOL_A_HEADER_DEV_INDEX
>   java-name: headerDevIndex

Herm, looking at my commits where I started going with the C codegen
(which I haven't posted here) is converting the values to the same
format as keys (i.e. YAML/JSON style with dashes). So the codegen does:

	c_name = attr['name']
	if c_name in c_keywords:
		c_name += '_'
	c_name = c_name.replace('-', '_')

So the name would be "dev-index", C will make that dev_index, Java will
make that devIndex (or whatever) etc.

I really don't want people to have to prefix the names because that's
creating more work. We can slap a /* header.dev_index */ comment in 
the generated uAPI, for the grep? Dunno..

> > Good catch, I'm aware. I was planning to add a "header constants"
> > section or some such. A section in "headers" which defines the
> > constants which C code will get from the headers.  
> 
> Define as in 're-define' or define as in 'you need to include some
> other header for this to work'?
> 
> const:
>   - name: ALTIFNAMSIZ
>     val: 128

This one. In most cases the constant is defined in the same uAPI header
as the proto so we're good. But there's IFNAMSIZ and friends which are
shared.

> which then does
> 
> #ifndef
> #define ALTIFNAMSIZ 128
> #else
> static_assert(ALTIFNAMSIZ == 128)
> #endif
> 
> ?
> 
> or:
> 
> external-const:
>   - name: ALTIFNAMSIZ
>     header: include/uapi/linux/if.h
> 
> which then might generate the following:
> 
> #include <include/uapi/linux/if.h>
> #ifndef ALTIFNAMSIZ
> #error "include/uapi/linux/if.h does not define ALTIFNAMSIZ"
> #endif
> 
> > For Python it does not matter, as we don't have to size arrays.  
> 
> Hm, I was expecting the situation to be the opposite :-) Because if
> you really have to know this len in python, how do you resolve
> ALTIFNAMSIZ?

Why does Python need to know the length of the string tho?
On receive if kernel gives you a longer name - great, no problem.
On send the kernel will tell you so also meh.

In C the struct has a char bla[FIXED_SIZE] so if we get an oversized
string we're pooped, that's my point, dunno what other practical use
the string sizing has.

> The simplest thing to do might be to require these headers to be
> hermetic (i.e., redefine all consts the spec cares about internally)?

That's what I'm thinking if they are actually needed. But it only C
cares we can just slap the right includes and not worry. Dunno if other
languages are similarly string-challenged. 

^ permalink raw reply

* [PATCH bpf-next v2 3/3] selftests/bpf: tests for using dynptrs to parse skb and xdp buffers
From: Joanne Koong @ 2022-08-11 23:05 UTC (permalink / raw)
  To: bpf; +Cc: kafai, kuba, andrii, daniel, ast, netdev, kernel-team,
	Joanne Koong
In-Reply-To: <20220811230501.2632393-1-joannelkoong@gmail.com>

Test skb and xdp dynptr functionality in the following ways:

1) progs/test_cls_redirect_dynptr.c
   * Rewrite "progs/test_cls_redirect.c" test to use dynptrs to parse
     skb data

   * This is a great example of how dynptrs can be used to simplify a
     lot of the parsing logic for non-statically known values, and speed
     up execution times.

     When measuring the user + system time between the original version
     vs. using dynptrs, and averaging the time for 10 runs (using
     "time ./test_progs -t cls_redirect"), there was a 2x speed-up:
         original version: 0.053 sec
         with dynptrs: 0.025 sec

2) progs/test_xdp_dynptr.c
   * Rewrite "progs/test_xdp.c" test to use dynptrs to parse xdp data

     There were no noticeable diferences in user + system time between
     the original version vs. using dynptrs. Averaging the time for 10
     runs (run using "time ./test_progs -t xdp_bpf2bpf"):
         original version: 0.0449 sec
         with dynptrs: 0.0429 sec

3) progs/test_l4lb_noinline_dynptr.c
   * Rewrite "progs/test_l4lb_noinline.c" test to use dynptrs to parse
     skb data

     There were no noticeable diferences in user + system time between
     the original version vs. using dynptrs. Averaging the time for 10
     runs (run using "time ./test_progs -t l4lb_all"):
         original version: 0.0502 sec
         with dynptrs: 0.055 sec

     For number of processed verifier instructions:
         original version: 6284 insns
         with dynptrs: 2538 insns

4) progs/test_parse_tcp_hdr_opt_dynptr.c
   * Add sample code for parsing tcp hdr opt lookup using dynptrs.
     This logic is lifted from a real-world use case of packet parsing in
     katran [0], a layer 4 load balancer. The original version
     "progs/test_parse_tcp_hdr_opt.c" (not using dynptrs) is included
     here as well, for comparison.

5) progs/dynptr_success.c
   * Add test case "test_skb_readonly" for testing attempts at writes /
     data slices on a prog type with read-only skb ctx.

6) progs/dynptr_fail.c
   * Add test cases "skb_invalid_data_slice{1,2}" and
     "xdp_invalid_data_slice" for testing that helpers that modify the
     underlying packet buffer automatically invalidate the associated
     data slice.
   * Add test cases "skb_invalid_ctx" and "xdp_invalid_ctx" for testing
     that prog types that do not support bpf_dynptr_from_skb/xdp don't
     have access to the API.
   * Add test case "skb_invalid_write" for testing that read-only skb
     dynptrs can't be written to through data slices.

[0] https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 .../selftests/bpf/prog_tests/cls_redirect.c   |  25 +
 .../testing/selftests/bpf/prog_tests/dynptr.c | 132 ++-
 .../selftests/bpf/prog_tests/l4lb_all.c       |   2 +
 .../bpf/prog_tests/parse_tcp_hdr_opt.c        |  85 ++
 .../selftests/bpf/prog_tests/xdp_attach.c     |   9 +-
 .../testing/selftests/bpf/progs/dynptr_fail.c | 111 ++
 .../selftests/bpf/progs/dynptr_success.c      |  29 +
 .../bpf/progs/test_cls_redirect_dynptr.c      | 968 ++++++++++++++++++
 .../bpf/progs/test_l4lb_noinline_dynptr.c     | 468 +++++++++
 .../bpf/progs/test_parse_tcp_hdr_opt.c        | 119 +++
 .../bpf/progs/test_parse_tcp_hdr_opt_dynptr.c | 110 ++
 .../selftests/bpf/progs/test_xdp_dynptr.c     | 240 +++++
 .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
 13 files changed, 2255 insertions(+), 44 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_dynptr.c

diff --git a/tools/testing/selftests/bpf/prog_tests/cls_redirect.c b/tools/testing/selftests/bpf/prog_tests/cls_redirect.c
index 224f016b0a53..2a55f717fc07 100644
--- a/tools/testing/selftests/bpf/prog_tests/cls_redirect.c
+++ b/tools/testing/selftests/bpf/prog_tests/cls_redirect.c
@@ -13,6 +13,7 @@
 
 #include "progs/test_cls_redirect.h"
 #include "test_cls_redirect.skel.h"
+#include "test_cls_redirect_dynptr.skel.h"
 #include "test_cls_redirect_subprogs.skel.h"
 
 #define ENCAP_IP INADDR_LOOPBACK
@@ -446,6 +447,28 @@ static void test_cls_redirect_common(struct bpf_program *prog)
 	close_fds((int *)conns, sizeof(conns) / sizeof(conns[0][0]));
 }
 
+static void test_cls_redirect_dynptr(void)
+{
+	struct test_cls_redirect_dynptr *skel;
+	int err;
+
+	skel = test_cls_redirect_dynptr__open();
+	if (!ASSERT_OK_PTR(skel, "skel_open"))
+		return;
+
+	skel->rodata->ENCAPSULATION_IP = htonl(ENCAP_IP);
+	skel->rodata->ENCAPSULATION_PORT = htons(ENCAP_PORT);
+
+	err = test_cls_redirect_dynptr__load(skel);
+	if (!ASSERT_OK(err, "skel_load"))
+		goto cleanup;
+
+	test_cls_redirect_common(skel->progs.cls_redirect);
+
+cleanup:
+	test_cls_redirect_dynptr__destroy(skel);
+}
+
 static void test_cls_redirect_inlined(void)
 {
 	struct test_cls_redirect *skel;
@@ -496,4 +519,6 @@ void test_cls_redirect(void)
 		test_cls_redirect_inlined();
 	if (test__start_subtest("cls_redirect_subprogs"))
 		test_cls_redirect_subprogs();
+	if (test__start_subtest("cls_redirect_dynptr"))
+		test_cls_redirect_dynptr();
 }
diff --git a/tools/testing/selftests/bpf/prog_tests/dynptr.c b/tools/testing/selftests/bpf/prog_tests/dynptr.c
index bcf80b9f7c27..3ec1a8b6b6fb 100644
--- a/tools/testing/selftests/bpf/prog_tests/dynptr.c
+++ b/tools/testing/selftests/bpf/prog_tests/dynptr.c
@@ -2,51 +2,69 @@
 /* Copyright (c) 2022 Facebook */
 
 #include <test_progs.h>
+#include <network_helpers.h>
 #include "dynptr_fail.skel.h"
 #include "dynptr_success.skel.h"
 
 static size_t log_buf_sz = 1048576; /* 1 MB */
 static char obj_log_buf[1048576];
 
+enum test_setup_type {
+	/* no set up is required. the prog will just be loaded */
+	SETUP_NONE,
+	SETUP_SYSCALL_SLEEP,
+	SETUP_SKB_PROG,
+};
+
 static struct {
 	const char *prog_name;
 	const char *expected_err_msg;
+	enum test_setup_type type;
 } dynptr_tests[] = {
-	/* failure cases */
-	{"ringbuf_missing_release1", "Unreleased reference id=1"},
-	{"ringbuf_missing_release2", "Unreleased reference id=2"},
-	{"ringbuf_missing_release_callback", "Unreleased reference id"},
-	{"use_after_invalid", "Expected an initialized dynptr as arg #3"},
-	{"ringbuf_invalid_api", "type=mem expected=alloc_mem"},
-	{"add_dynptr_to_map1", "invalid indirect read from stack"},
-	{"add_dynptr_to_map2", "invalid indirect read from stack"},
-	{"data_slice_out_of_bounds_ringbuf", "value is outside of the allowed memory range"},
-	{"data_slice_out_of_bounds_map_value", "value is outside of the allowed memory range"},
-	{"data_slice_use_after_release1", "invalid mem access 'scalar'"},
-	{"data_slice_use_after_release2", "invalid mem access 'scalar'"},
-	{"data_slice_missing_null_check1", "invalid mem access 'mem_or_null'"},
-	{"data_slice_missing_null_check2", "invalid mem access 'mem_or_null'"},
-	{"invalid_helper1", "invalid indirect read from stack"},
-	{"invalid_helper2", "Expected an initialized dynptr as arg #3"},
-	{"invalid_write1", "Expected an initialized dynptr as arg #1"},
-	{"invalid_write2", "Expected an initialized dynptr as arg #3"},
-	{"invalid_write3", "Expected an initialized ringbuf dynptr as arg #1"},
-	{"invalid_write4", "arg 1 is an unacquired reference"},
-	{"invalid_read1", "invalid read from stack"},
-	{"invalid_read2", "cannot pass in dynptr at an offset"},
-	{"invalid_read3", "invalid read from stack"},
-	{"invalid_read4", "invalid read from stack"},
-	{"invalid_offset", "invalid write to stack"},
-	{"global", "type=map_value expected=fp"},
-	{"release_twice", "arg 1 is an unacquired reference"},
-	{"release_twice_callback", "arg 1 is an unacquired reference"},
+	/* these cases should trigger a verifier error */
+	{"ringbuf_missing_release1", "Unreleased reference id=1", SETUP_NONE},
+	{"ringbuf_missing_release2", "Unreleased reference id=2", SETUP_NONE},
+	{"ringbuf_missing_release_callback", "Unreleased reference id", SETUP_NONE},
+	{"use_after_invalid", "Expected an initialized dynptr as arg #3", SETUP_NONE},
+	{"ringbuf_invalid_api", "type=mem expected=alloc_mem", SETUP_NONE},
+	{"add_dynptr_to_map1", "invalid indirect read from stack", SETUP_NONE},
+	{"add_dynptr_to_map2", "invalid indirect read from stack", SETUP_NONE},
+	{"data_slice_out_of_bounds_ringbuf", "value is outside of the allowed memory range",
+		SETUP_NONE},
+	{"data_slice_out_of_bounds_map_value", "value is outside of the allowed memory range",
+		SETUP_NONE},
+	{"data_slice_use_after_release1", "invalid mem access 'scalar'", SETUP_NONE},
+	{"data_slice_use_after_release2", "invalid mem access 'scalar'", SETUP_NONE},
+	{"data_slice_missing_null_check1", "invalid mem access 'mem_or_null'", SETUP_NONE},
+	{"data_slice_missing_null_check2", "invalid mem access 'mem_or_null'", SETUP_NONE},
+	{"invalid_helper1", "invalid indirect read from stack", SETUP_NONE},
+	{"invalid_helper2", "Expected an initialized dynptr as arg #3", SETUP_NONE},
+	{"invalid_write1", "Expected an initialized dynptr as arg #1", SETUP_NONE},
+	{"invalid_write2", "Expected an initialized dynptr as arg #3", SETUP_NONE},
+	{"invalid_write3", "Expected an initialized ringbuf dynptr as arg #1", SETUP_NONE},
+	{"invalid_write4", "arg 1 is an unacquired reference", SETUP_NONE},
+	{"invalid_read1", "invalid read from stack", SETUP_NONE},
+	{"invalid_read2", "cannot pass in dynptr at an offset", SETUP_NONE},
+	{"invalid_read3", "invalid read from stack", SETUP_NONE},
+	{"invalid_read4", "invalid read from stack", SETUP_NONE},
+	{"invalid_offset", "invalid write to stack", SETUP_NONE},
+	{"global", "type=map_value expected=fp", SETUP_NONE},
+	{"release_twice", "arg 1 is an unacquired reference", SETUP_NONE},
+	{"release_twice_callback", "arg 1 is an unacquired reference", SETUP_NONE},
 	{"dynptr_from_mem_invalid_api",
-		"Unsupported reg type fp for bpf_dynptr_from_mem data"},
-
-	/* success cases */
-	{"test_read_write", NULL},
-	{"test_data_slice", NULL},
-	{"test_ringbuf", NULL},
+		"Unsupported reg type fp for bpf_dynptr_from_mem data", SETUP_NONE},
+	{"skb_invalid_data_slice1", "invalid mem access 'scalar'", SETUP_NONE},
+	{"skb_invalid_data_slice2", "invalid mem access 'scalar'", SETUP_NONE},
+	{"xdp_invalid_data_slice", "invalid mem access 'scalar'", SETUP_NONE},
+	{"skb_invalid_ctx", "unknown func bpf_dynptr_from_skb", SETUP_NONE},
+	{"xdp_invalid_ctx", "unknown func bpf_dynptr_from_xdp", SETUP_NONE},
+	{"skb_invalid_write", "cannot write into packet", SETUP_NONE},
+
+	/* these tests should be run and should succeed */
+	{"test_read_write", NULL, SETUP_SYSCALL_SLEEP},
+	{"test_data_slice", NULL, SETUP_SYSCALL_SLEEP},
+	{"test_ringbuf", NULL, SETUP_SYSCALL_SLEEP},
+	{"test_skb_readonly", NULL, SETUP_SKB_PROG},
 };
 
 static void verify_fail(const char *prog_name, const char *expected_err_msg)
@@ -85,7 +103,7 @@ static void verify_fail(const char *prog_name, const char *expected_err_msg)
 	dynptr_fail__destroy(skel);
 }
 
-static void verify_success(const char *prog_name)
+static void run_test(const char *prog_name, enum test_setup_type setup_type)
 {
 	struct dynptr_success *skel;
 	struct bpf_program *prog;
@@ -107,15 +125,45 @@ static void verify_success(const char *prog_name)
 	if (!ASSERT_OK_PTR(prog, "bpf_object__find_program_by_name"))
 		goto cleanup;
 
-	link = bpf_program__attach(prog);
-	if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
-		goto cleanup;
+	switch (setup_type) {
+	case SETUP_SYSCALL_SLEEP:
+		link = bpf_program__attach(prog);
+		if (!ASSERT_OK_PTR(link, "bpf_program__attach"))
+			goto cleanup;
 
-	usleep(1);
+		usleep(1);
 
-	ASSERT_EQ(skel->bss->err, 0, "err");
+		bpf_link__destroy(link);
+		break;
+	case SETUP_SKB_PROG:
+	{
+		int prog_fd, err;
+		char buf[64];
+
+		LIBBPF_OPTS(bpf_test_run_opts, topts,
+			    .data_in = &pkt_v4,
+			    .data_size_in = sizeof(pkt_v4),
+			    .data_out = buf,
+			    .data_size_out = sizeof(buf),
+			    .repeat = 1,
+		);
+
+		prog_fd = bpf_program__fd(prog);
+		if (!ASSERT_GE(prog_fd, 0, "prog_fd"))
+			goto cleanup;
 
-	bpf_link__destroy(link);
+		err = bpf_prog_test_run_opts(prog_fd, &topts);
+
+		if (!ASSERT_OK(err, "test_run"))
+			goto cleanup;
+
+		break;
+	}
+	case SETUP_NONE:
+		ASSERT_EQ(0, 1, "internal error: SETUP_NONE unimplemented");
+	}
+
+	ASSERT_EQ(skel->bss->err, 0, "err");
 
 cleanup:
 	dynptr_success__destroy(skel);
@@ -133,6 +181,6 @@ void test_dynptr(void)
 			verify_fail(dynptr_tests[i].prog_name,
 				    dynptr_tests[i].expected_err_msg);
 		else
-			verify_success(dynptr_tests[i].prog_name);
+			run_test(dynptr_tests[i].prog_name, dynptr_tests[i].type);
 	}
 }
diff --git a/tools/testing/selftests/bpf/prog_tests/l4lb_all.c b/tools/testing/selftests/bpf/prog_tests/l4lb_all.c
index 55f733ff4109..94079c89f2e9 100644
--- a/tools/testing/selftests/bpf/prog_tests/l4lb_all.c
+++ b/tools/testing/selftests/bpf/prog_tests/l4lb_all.c
@@ -93,4 +93,6 @@ void test_l4lb_all(void)
 		test_l4lb("test_l4lb.o");
 	if (test__start_subtest("l4lb_noinline"))
 		test_l4lb("test_l4lb_noinline.o");
+	if (test__start_subtest("l4lb_noinline_dynptr"))
+		test_l4lb("test_l4lb_noinline_dynptr.o");
 }
diff --git a/tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c b/tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c
new file mode 100644
index 000000000000..0fe729ccedca
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c
@@ -0,0 +1,85 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "test_parse_tcp_hdr_opt.skel.h"
+#include "test_parse_tcp_hdr_opt_dynptr.skel.h"
+#include "test_tcp_hdr_options.h"
+
+struct test_pkt {
+	struct ipv6_packet pk6_v6;
+	u8 options[16];
+} __packed;
+
+struct test_pkt pkt = {
+	.pk6_v6.eth.h_proto = __bpf_constant_htons(ETH_P_IPV6),
+	.pk6_v6.iph.nexthdr = IPPROTO_TCP,
+	.pk6_v6.iph.payload_len = __bpf_constant_htons(MAGIC_BYTES),
+	.pk6_v6.tcp.urg_ptr = 123,
+	.pk6_v6.tcp.doff = 9, /* 16 bytes of options */
+
+	.options = {
+		TCPOPT_MSS, 4, 0x05, 0xB4, TCPOPT_NOP, TCPOPT_NOP,
+		0, 6, 0, 0, 0, 9, TCPOPT_EOL
+	},
+};
+
+static void test_parsing(bool use_dynptr)
+{
+	char buf[128];
+	struct bpf_program *prog;
+	void *skel_ptr;
+	int err;
+
+	LIBBPF_OPTS(bpf_test_run_opts, topts,
+		    .data_in = &pkt,
+		    .data_size_in = sizeof(pkt),
+		    .data_out = buf,
+		    .data_size_out = sizeof(buf),
+		    .repeat = 3,
+	);
+
+	if (use_dynptr) {
+		struct test_parse_tcp_hdr_opt_dynptr *skel;
+
+		skel = test_parse_tcp_hdr_opt_dynptr__open_and_load();
+		if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+			return;
+
+		pkt.options[6] = skel->rodata->tcp_hdr_opt_kind_tpr;
+		prog = skel->progs.xdp_ingress_v6;
+		skel_ptr = skel;
+	} else {
+		struct test_parse_tcp_hdr_opt *skel;
+
+		skel = test_parse_tcp_hdr_opt__open_and_load();
+		if (!ASSERT_OK_PTR(skel, "skel_open_and_load"))
+			return;
+
+		pkt.options[6] = skel->rodata->tcp_hdr_opt_kind_tpr;
+		prog = skel->progs.xdp_ingress_v6;
+		skel_ptr = skel;
+	}
+
+	err = bpf_prog_test_run_opts(bpf_program__fd(prog), &topts);
+	ASSERT_OK(err, "ipv6 test_run");
+	ASSERT_EQ(topts.retval, XDP_PASS, "ipv6 test_run retval");
+
+	if (use_dynptr) {
+		struct test_parse_tcp_hdr_opt_dynptr *skel = skel_ptr;
+
+		ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
+		test_parse_tcp_hdr_opt_dynptr__destroy(skel);
+	} else {
+		struct test_parse_tcp_hdr_opt *skel = skel_ptr;
+
+		ASSERT_EQ(skel->bss->server_id, 0x9000000, "server id");
+		test_parse_tcp_hdr_opt__destroy(skel);
+	}
+}
+
+void test_parse_tcp_hdr_opt(void)
+{
+	test_parsing(false);
+	test_parsing(true);
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_attach.c b/tools/testing/selftests/bpf/prog_tests/xdp_attach.c
index 62aa3edda5e6..40d0d61af9e6 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_attach.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_attach.c
@@ -4,11 +4,10 @@
 #define IFINDEX_LO 1
 #define XDP_FLAGS_REPLACE		(1U << 4)
 
-void serial_test_xdp_attach(void)
+static void serial_test_xdp_attach(const char *file)
 {
 	__u32 duration = 0, id1, id2, id0 = 0, len;
 	struct bpf_object *obj1, *obj2, *obj3;
-	const char *file = "./test_xdp.o";
 	struct bpf_prog_info info = {};
 	int err, fd1, fd2, fd3;
 	LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
@@ -85,3 +84,9 @@ void serial_test_xdp_attach(void)
 out_1:
 	bpf_object__close(obj1);
 }
+
+void test_xdp_attach(void)
+{
+	serial_test_xdp_attach("./test_xdp.o");
+	serial_test_xdp_attach("./test_xdp_dynptr.o");
+}
diff --git a/tools/testing/selftests/bpf/progs/dynptr_fail.c b/tools/testing/selftests/bpf/progs/dynptr_fail.c
index b0f08ff024fb..141765b2fcb5 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_fail.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_fail.c
@@ -5,6 +5,7 @@
 #include <string.h>
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>
+#include <linux/if_ether.h>
 #include "bpf_misc.h"
 
 char _license[] SEC("license") = "GPL";
@@ -622,3 +623,113 @@ int dynptr_from_mem_invalid_api(void *ctx)
 
 	return 0;
 }
+
+/* The data slice is invalidated whenever a helper changes packet data */
+SEC("?tc")
+int skb_invalid_data_slice1(struct __sk_buff *skb)
+{
+	struct bpf_dynptr ptr;
+	struct ethhdr *hdr;
+
+	bpf_dynptr_from_skb(skb, 0, &ptr);
+	hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
+
+	if (bpf_skb_pull_data(skb, skb->len))
+		return SK_DROP;
+
+	if (!hdr)
+		return SK_DROP;
+
+	/* this should fail */
+	hdr->h_proto = 1;
+
+	return SK_PASS;
+}
+
+/* The data slice is invalidated whenever bpf_dynptr_write() is called */
+SEC("?tc")
+int skb_invalid_data_slice2(struct __sk_buff *skb)
+{
+	char write_data[64] = "hello there, world!!";
+	struct bpf_dynptr ptr;
+	struct ethhdr *hdr;
+
+	bpf_dynptr_from_skb(skb, 0, &ptr);
+	hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
+
+	bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
+
+	if (!hdr)
+		return SK_DROP;
+
+	/* this should fail */
+	hdr->h_proto = 1;
+
+	return SK_PASS;
+}
+
+/* The data slice is invalidated whenever a helper changes packet data */
+SEC("?xdp")
+int xdp_invalid_data_slice(struct xdp_md *xdp)
+{
+	struct bpf_dynptr ptr;
+	struct ethhdr *hdr;
+
+	bpf_dynptr_from_xdp(xdp, 0, &ptr);
+	hdr = bpf_dynptr_data(&ptr, 0, sizeof(*hdr));
+	if (!hdr)
+		return SK_DROP;
+
+	hdr->h_proto = 9;
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)sizeof(*hdr)))
+		return XDP_DROP;
+
+	/* this should fail */
+	hdr->h_proto = 1;
+
+	return XDP_PASS;
+}
+
+/* Only supported prog type can create skb-type dynptrs */
+SEC("?raw_tp/sys_nanosleep")
+int skb_invalid_ctx(void *ctx)
+{
+	struct bpf_dynptr ptr;
+
+	/* this should fail */
+	bpf_dynptr_from_skb(ctx, 0, &ptr);
+
+	return 0;
+}
+
+/* Only supported prog type can create xdp-type dynptrs */
+SEC("?raw_tp/sys_nanosleep")
+int xdp_invalid_ctx(void *ctx)
+{
+	struct bpf_dynptr ptr;
+
+	/* this should fail */
+	bpf_dynptr_from_xdp(ctx, 0, &ptr);
+
+	return 0;
+}
+
+/* Read-only skb packet buffers can't be written to through data slices */
+SEC("?cgroup_skb/egress")
+int skb_invalid_write(struct __sk_buff *skb)
+{
+	struct bpf_dynptr ptr;
+	__u64 *data;
+
+	bpf_dynptr_from_skb(skb, 0, &ptr);
+
+	data = bpf_dynptr_data(&ptr, 0, sizeof(*data));
+	if (!data)
+		return 0;
+
+	/* this should fail */
+	*data = 123;
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/dynptr_success.c b/tools/testing/selftests/bpf/progs/dynptr_success.c
index a3a6103c8569..d0e504d7aae7 100644
--- a/tools/testing/selftests/bpf/progs/dynptr_success.c
+++ b/tools/testing/selftests/bpf/progs/dynptr_success.c
@@ -162,3 +162,32 @@ int test_ringbuf(void *ctx)
 	bpf_ringbuf_discard_dynptr(&ptr, 0);
 	return 0;
 }
+
+SEC("cgroup_skb/egress")
+int test_skb_readonly(struct __sk_buff *skb)
+{
+	__u8 write_data[2] = {1, 2};
+	struct bpf_dynptr ptr;
+	__u64 *data;
+	int ret;
+
+	if (bpf_dynptr_from_skb(skb, 0, &ptr)) {
+		err = 1;
+		return 0;
+	}
+
+	data = bpf_dynptr_data(&ptr, 0, sizeof(*data));
+	if (!data) {
+		err = 2;
+		return 0;
+	}
+
+	/* since cgroup skbs are read only, writes should fail */
+	ret = bpf_dynptr_write(&ptr, 0, write_data, sizeof(write_data), 0);
+	if (ret != -EINVAL) {
+		err = 3;
+		return 0;
+	}
+
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c b/tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c
new file mode 100644
index 000000000000..9549ff7f16b9
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c
@@ -0,0 +1,968 @@
+// SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+// Copyright (c) 2019, 2020 Cloudflare
+
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+#include <string.h>
+
+#include <linux/bpf.h>
+#include <linux/icmp.h>
+#include <linux/icmpv6.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/pkt_cls.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#include "test_cls_redirect.h"
+
+#define offsetofend(TYPE, MEMBER) \
+	(offsetof(TYPE, MEMBER) + sizeof((((TYPE *)0)->MEMBER)))
+
+#define IP_OFFSET_MASK (0x1FFF)
+#define IP_MF (0x2000)
+
+char _license[] SEC("license") = "Dual BSD/GPL";
+
+/**
+ * Destination port and IP used for UDP encapsulation.
+ */
+volatile const __be16 ENCAPSULATION_PORT;
+volatile const __be32 ENCAPSULATION_IP;
+
+typedef struct {
+	uint64_t processed_packets_total;
+	uint64_t l3_protocol_packets_total_ipv4;
+	uint64_t l3_protocol_packets_total_ipv6;
+	uint64_t l4_protocol_packets_total_tcp;
+	uint64_t l4_protocol_packets_total_udp;
+	uint64_t accepted_packets_total_syn;
+	uint64_t accepted_packets_total_syn_cookies;
+	uint64_t accepted_packets_total_last_hop;
+	uint64_t accepted_packets_total_icmp_echo_request;
+	uint64_t accepted_packets_total_established;
+	uint64_t forwarded_packets_total_gue;
+	uint64_t forwarded_packets_total_gre;
+
+	uint64_t errors_total_unknown_l3_proto;
+	uint64_t errors_total_unknown_l4_proto;
+	uint64_t errors_total_malformed_ip;
+	uint64_t errors_total_fragmented_ip;
+	uint64_t errors_total_malformed_icmp;
+	uint64_t errors_total_unwanted_icmp;
+	uint64_t errors_total_malformed_icmp_pkt_too_big;
+	uint64_t errors_total_malformed_tcp;
+	uint64_t errors_total_malformed_udp;
+	uint64_t errors_total_icmp_echo_replies;
+	uint64_t errors_total_malformed_encapsulation;
+	uint64_t errors_total_encap_adjust_failed;
+	uint64_t errors_total_encap_buffer_too_small;
+	uint64_t errors_total_redirect_loop;
+	uint64_t errors_total_encap_mtu_violate;
+} metrics_t;
+
+typedef enum {
+	INVALID = 0,
+	UNKNOWN,
+	ECHO_REQUEST,
+	SYN,
+	SYN_COOKIE,
+	ESTABLISHED,
+} verdict_t;
+
+typedef struct {
+	uint16_t src, dst;
+} flow_ports_t;
+
+_Static_assert(
+	sizeof(flow_ports_t) !=
+		offsetofend(struct bpf_sock_tuple, ipv4.dport) -
+			offsetof(struct bpf_sock_tuple, ipv4.sport) - 1,
+	"flow_ports_t must match sport and dport in struct bpf_sock_tuple");
+_Static_assert(
+	sizeof(flow_ports_t) !=
+		offsetofend(struct bpf_sock_tuple, ipv6.dport) -
+			offsetof(struct bpf_sock_tuple, ipv6.sport) - 1,
+	"flow_ports_t must match sport and dport in struct bpf_sock_tuple");
+
+struct iphdr_info {
+	void *hdr;
+	__u64 len;
+};
+
+typedef int ret_t;
+
+/* This is a bit of a hack. We need a return value which allows us to
+ * indicate that the regular flow of the program should continue,
+ * while allowing functions to use XDP_PASS and XDP_DROP, etc.
+ */
+static const ret_t CONTINUE_PROCESSING = -1;
+
+/* Convenience macro to call functions which return ret_t.
+ */
+#define MAYBE_RETURN(x)                           \
+	do {                                      \
+		ret_t __ret = x;                  \
+		if (__ret != CONTINUE_PROCESSING) \
+			return __ret;             \
+	} while (0)
+
+static bool ipv4_is_fragment(const struct iphdr *ip)
+{
+	uint16_t frag_off = ip->frag_off & bpf_htons(IP_OFFSET_MASK);
+	return (ip->frag_off & bpf_htons(IP_MF)) != 0 || frag_off > 0;
+}
+
+static int pkt_parse_ipv4(struct bpf_dynptr *dynptr, __u64 *offset, struct iphdr *iphdr)
+{
+	if (bpf_dynptr_read(iphdr, sizeof(*iphdr), dynptr, *offset, 0))
+		return -1;
+
+	*offset += sizeof(*iphdr);
+
+	if (iphdr->ihl < 5)
+		return -1;
+
+	/* skip ipv4 options */
+	*offset += (iphdr->ihl - 5) * 4;
+
+	return 0;
+}
+
+/* Parse the L4 ports from a packet, assuming a layout like TCP or UDP. */
+static bool pkt_parse_icmp_l4_ports(struct bpf_dynptr *dynptr, __u64 *offset, flow_ports_t *ports)
+{
+	if (bpf_dynptr_read(ports, sizeof(*ports), dynptr, *offset, 0))
+		return false;
+
+	*offset += sizeof(*ports);
+
+	/* Ports in the L4 headers are reversed, since we are parsing an ICMP
+	 * payload which is going towards the eyeball.
+	 */
+	uint16_t dst = ports->src;
+	ports->src = ports->dst;
+	ports->dst = dst;
+	return true;
+}
+
+static uint16_t pkt_checksum_fold(uint32_t csum)
+{
+	/* The highest reasonable value for an IPv4 header
+	 * checksum requires two folds, so we just do that always.
+	 */
+	csum = (csum & 0xffff) + (csum >> 16);
+	csum = (csum & 0xffff) + (csum >> 16);
+	return (uint16_t)~csum;
+}
+
+static void pkt_ipv4_checksum(struct iphdr *iph)
+{
+	iph->check = 0;
+
+	/* An IP header without options is 20 bytes. Two of those
+	 * are the checksum, which we always set to zero. Hence,
+	 * the maximum accumulated value is 18 / 2 * 0xffff = 0x8fff7,
+	 * which fits in 32 bit.
+	 */
+	_Static_assert(sizeof(struct iphdr) == 20, "iphdr must be 20 bytes");
+	uint32_t acc = 0;
+	uint16_t *ipw = (uint16_t *)iph;
+
+	for (size_t i = 0; i < sizeof(struct iphdr) / 2; i++)
+		acc += ipw[i];
+
+	iph->check = pkt_checksum_fold(acc);
+}
+
+static bool pkt_skip_ipv6_extension_headers(struct bpf_dynptr *dynptr, __u64 *offset,
+					    const struct ipv6hdr *ipv6, uint8_t *upper_proto,
+					    bool *is_fragment)
+{
+	/* We understand five extension headers.
+	 * https://tools.ietf.org/html/rfc8200#section-4.1 states that all
+	 * headers should occur once, except Destination Options, which may
+	 * occur twice. Hence we give up after 6 headers.
+	 */
+	struct {
+		uint8_t next;
+		uint8_t len;
+	} exthdr = {
+		.next = ipv6->nexthdr,
+	};
+	*is_fragment = false;
+
+	for (int i = 0; i < 6; i++) {
+		switch (exthdr.next) {
+		case IPPROTO_FRAGMENT:
+			*is_fragment = true;
+			/* NB: We don't check that hdrlen == 0 as per spec. */
+			/* fallthrough; */
+
+		case IPPROTO_HOPOPTS:
+		case IPPROTO_ROUTING:
+		case IPPROTO_DSTOPTS:
+		case IPPROTO_MH:
+			if (bpf_dynptr_read(&exthdr, sizeof(exthdr), dynptr, *offset, 0))
+				return false;
+
+			/* hdrlen is in 8-octet units, and excludes the first 8 octets. */
+			*offset += (exthdr.len + 1) * 8;
+
+			/* Decode next header */
+			break;
+
+		default:
+			/* The next header is not one of the known extension
+			 * headers, treat it as the upper layer header.
+			 *
+			 * This handles IPPROTO_NONE.
+			 *
+			 * Encapsulating Security Payload (50) and Authentication
+			 * Header (51) also end up here (and will trigger an
+			 * unknown proto error later). They have a custom header
+			 * format and seem too esoteric to care about.
+			 */
+			*upper_proto = exthdr.next;
+			return true;
+		}
+	}
+
+	/* We never found an upper layer header. */
+	return false;
+}
+
+static int pkt_parse_ipv6(struct bpf_dynptr *dynptr, __u64 *offset, struct ipv6hdr *ipv6,
+			  uint8_t *proto, bool *is_fragment)
+{
+	if (bpf_dynptr_read(ipv6, sizeof(*ipv6), dynptr, *offset, 0))
+		return -1;
+
+	*offset += sizeof(*ipv6);
+
+	if (!pkt_skip_ipv6_extension_headers(dynptr, offset, ipv6, proto, is_fragment))
+		return -1;
+
+	return 0;
+}
+
+/* Global metrics, per CPU
+ */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, unsigned int);
+	__type(value, metrics_t);
+} metrics_map SEC(".maps");
+
+static metrics_t *get_global_metrics(void)
+{
+	uint64_t key = 0;
+	return bpf_map_lookup_elem(&metrics_map, &key);
+}
+
+static ret_t accept_locally(struct __sk_buff *skb, encap_headers_t *encap)
+{
+	const int payload_off =
+		sizeof(*encap) +
+		sizeof(struct in_addr) * encap->unigue.hop_count;
+	int32_t encap_overhead = payload_off - sizeof(struct ethhdr);
+
+	// Changing the ethertype if the encapsulated packet is ipv6
+	if (encap->gue.proto_ctype == IPPROTO_IPV6)
+		encap->eth.h_proto = bpf_htons(ETH_P_IPV6);
+
+	if (bpf_skb_adjust_room(skb, -encap_overhead, BPF_ADJ_ROOM_MAC,
+				BPF_F_ADJ_ROOM_FIXED_GSO |
+				BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||
+	    bpf_csum_level(skb, BPF_CSUM_LEVEL_DEC))
+		return TC_ACT_SHOT;
+
+	return bpf_redirect(skb->ifindex, BPF_F_INGRESS);
+}
+
+static ret_t forward_with_gre(struct __sk_buff *skb, struct bpf_dynptr *dynptr,
+			      encap_headers_t *encap, struct in_addr *next_hop,
+			      metrics_t *metrics)
+{
+	const int payload_off =
+		sizeof(*encap) +
+		sizeof(struct in_addr) * encap->unigue.hop_count;
+	int32_t encap_overhead =
+		payload_off - sizeof(struct ethhdr) - sizeof(struct iphdr);
+	int32_t delta = sizeof(struct gre_base_hdr) - encap_overhead;
+	uint16_t proto = ETH_P_IP;
+	uint32_t mtu_len = 0;
+	encap_gre_t *encap_gre;
+
+	metrics->forwarded_packets_total_gre++;
+
+	/* Loop protection: the inner packet's TTL is decremented as a safeguard
+	 * against any forwarding loop. As the only interesting field is the TTL
+	 * hop limit for IPv6, it is easier to use bpf_skb_load_bytes/bpf_skb_store_bytes
+	 * as they handle the split packets if needed (no need for the data to be
+	 * in the linear section).
+	 */
+	if (encap->gue.proto_ctype == IPPROTO_IPV6) {
+		proto = ETH_P_IPV6;
+		uint8_t ttl;
+		int rc;
+
+		rc = bpf_skb_load_bytes(
+			skb, payload_off + offsetof(struct ipv6hdr, hop_limit),
+			&ttl, 1);
+		if (rc != 0) {
+			metrics->errors_total_malformed_encapsulation++;
+			return TC_ACT_SHOT;
+		}
+
+		if (ttl == 0) {
+			metrics->errors_total_redirect_loop++;
+			return TC_ACT_SHOT;
+		}
+
+		ttl--;
+		rc = bpf_skb_store_bytes(
+			skb, payload_off + offsetof(struct ipv6hdr, hop_limit),
+			&ttl, 1, 0);
+		if (rc != 0) {
+			metrics->errors_total_malformed_encapsulation++;
+			return TC_ACT_SHOT;
+		}
+	} else {
+		uint8_t ttl;
+		int rc;
+
+		rc = bpf_skb_load_bytes(
+			skb, payload_off + offsetof(struct iphdr, ttl), &ttl,
+			1);
+		if (rc != 0) {
+			metrics->errors_total_malformed_encapsulation++;
+			return TC_ACT_SHOT;
+		}
+
+		if (ttl == 0) {
+			metrics->errors_total_redirect_loop++;
+			return TC_ACT_SHOT;
+		}
+
+		/* IPv4 also has a checksum to patch. While the TTL is only one byte,
+		 * this function only works for 2 and 4 bytes arguments (the result is
+		 * the same).
+		 */
+		rc = bpf_l3_csum_replace(
+			skb, payload_off + offsetof(struct iphdr, check), ttl,
+			ttl - 1, 2);
+		if (rc != 0) {
+			metrics->errors_total_malformed_encapsulation++;
+			return TC_ACT_SHOT;
+		}
+
+		ttl--;
+		rc = bpf_skb_store_bytes(
+			skb, payload_off + offsetof(struct iphdr, ttl), &ttl, 1,
+			0);
+		if (rc != 0) {
+			metrics->errors_total_malformed_encapsulation++;
+			return TC_ACT_SHOT;
+		}
+	}
+
+	if (bpf_check_mtu(skb, skb->ifindex, &mtu_len, delta, 0)) {
+		metrics->errors_total_encap_mtu_violate++;
+		return TC_ACT_SHOT;
+	}
+
+	if (bpf_skb_adjust_room(skb, delta, BPF_ADJ_ROOM_NET,
+				BPF_F_ADJ_ROOM_FIXED_GSO |
+				BPF_F_ADJ_ROOM_NO_CSUM_RESET) ||
+	    bpf_csum_level(skb, BPF_CSUM_LEVEL_INC)) {
+		metrics->errors_total_encap_adjust_failed++;
+		return TC_ACT_SHOT;
+	}
+
+	if (bpf_skb_pull_data(skb, sizeof(encap_gre_t))) {
+		metrics->errors_total_encap_buffer_too_small++;
+		return TC_ACT_SHOT;
+	}
+
+	encap_gre = bpf_dynptr_data(dynptr, 0, sizeof(encap_gre_t));
+	if (!encap_gre) {
+		metrics->errors_total_encap_buffer_too_small++;
+		return TC_ACT_SHOT;
+	}
+
+	encap_gre->ip.protocol = IPPROTO_GRE;
+	encap_gre->ip.daddr = next_hop->s_addr;
+	encap_gre->ip.saddr = ENCAPSULATION_IP;
+	encap_gre->ip.tot_len =
+		bpf_htons(bpf_ntohs(encap_gre->ip.tot_len) + delta);
+	encap_gre->gre.flags = 0;
+	encap_gre->gre.protocol = bpf_htons(proto);
+	pkt_ipv4_checksum((void *)&encap_gre->ip);
+
+	return bpf_redirect(skb->ifindex, 0);
+}
+
+static ret_t forward_to_next_hop(struct __sk_buff *skb, struct bpf_dynptr *dynptr,
+				 encap_headers_t *encap, struct in_addr *next_hop,
+				 metrics_t *metrics)
+{
+	/* swap L2 addresses */
+	/* This assumes that packets are received from a router.
+	 * So just swapping the MAC addresses here will make the packet go back to
+	 * the router, which will send it to the appropriate machine.
+	 */
+	unsigned char temp[ETH_ALEN];
+	memcpy(temp, encap->eth.h_dest, sizeof(temp));
+	memcpy(encap->eth.h_dest, encap->eth.h_source,
+	       sizeof(encap->eth.h_dest));
+	memcpy(encap->eth.h_source, temp, sizeof(encap->eth.h_source));
+
+	if (encap->unigue.next_hop == encap->unigue.hop_count - 1 &&
+	    encap->unigue.last_hop_gre) {
+		return forward_with_gre(skb, dynptr, encap, next_hop, metrics);
+	}
+
+	metrics->forwarded_packets_total_gue++;
+	uint32_t old_saddr = encap->ip.saddr;
+	encap->ip.saddr = encap->ip.daddr;
+	encap->ip.daddr = next_hop->s_addr;
+	if (encap->unigue.next_hop < encap->unigue.hop_count) {
+		encap->unigue.next_hop++;
+	}
+
+	/* Remove ip->saddr, add next_hop->s_addr */
+	const uint64_t off = offsetof(typeof(*encap), ip.check);
+	int ret = bpf_l3_csum_replace(skb, off, old_saddr, next_hop->s_addr, 4);
+	if (ret < 0) {
+		return TC_ACT_SHOT;
+	}
+
+	return bpf_redirect(skb->ifindex, 0);
+}
+
+static ret_t skip_next_hops(__u64 *offset, int n)
+{
+	__u32 res;
+	switch (n) {
+	case 1:
+		*offset += sizeof(struct in_addr);
+	case 0:
+		return CONTINUE_PROCESSING;
+
+	default:
+		return TC_ACT_SHOT;
+	}
+}
+
+/* Get the next hop from the GLB header.
+ *
+ * Sets next_hop->s_addr to 0 if there are no more hops left.
+ * pkt is positioned just after the variable length GLB header
+ * iff the call is successful.
+ */
+static ret_t get_next_hop(struct bpf_dynptr *dynptr, __u64 *offset, encap_headers_t *encap,
+			  struct in_addr *next_hop)
+{
+	if (encap->unigue.next_hop > encap->unigue.hop_count)
+		return TC_ACT_SHOT;
+
+	/* Skip "used" next hops. */
+	MAYBE_RETURN(skip_next_hops(offset, encap->unigue.next_hop));
+
+	if (encap->unigue.next_hop == encap->unigue.hop_count) {
+		/* No more next hops, we are at the end of the GLB header. */
+		next_hop->s_addr = 0;
+		return CONTINUE_PROCESSING;
+	}
+
+	if (bpf_dynptr_read(next_hop, sizeof(*next_hop), dynptr, *offset, 0))
+		return TC_ACT_SHOT;
+
+	*offset += sizeof(*next_hop);
+
+	/* Skip the remainig next hops (may be zero). */
+	return skip_next_hops(offset, encap->unigue.hop_count - encap->unigue.next_hop - 1);
+}
+
+/* Fill a bpf_sock_tuple to be used with the socket lookup functions.
+ * This is a kludge that let's us work around verifier limitations:
+ *
+ *    fill_tuple(&t, foo, sizeof(struct iphdr), 123, 321)
+ *
+ * clang will substitue a costant for sizeof, which allows the verifier
+ * to track it's value. Based on this, it can figure out the constant
+ * return value, and calling code works while still being "generic" to
+ * IPv4 and IPv6.
+ */
+static uint64_t fill_tuple(struct bpf_sock_tuple *tuple, void *iph,
+				    uint64_t iphlen, uint16_t sport, uint16_t dport)
+{
+	switch (iphlen) {
+	case sizeof(struct iphdr): {
+		struct iphdr *ipv4 = (struct iphdr *)iph;
+		tuple->ipv4.daddr = ipv4->daddr;
+		tuple->ipv4.saddr = ipv4->saddr;
+		tuple->ipv4.sport = sport;
+		tuple->ipv4.dport = dport;
+		return sizeof(tuple->ipv4);
+	}
+
+	case sizeof(struct ipv6hdr): {
+		struct ipv6hdr *ipv6 = (struct ipv6hdr *)iph;
+		memcpy(&tuple->ipv6.daddr, &ipv6->daddr,
+		       sizeof(tuple->ipv6.daddr));
+		memcpy(&tuple->ipv6.saddr, &ipv6->saddr,
+		       sizeof(tuple->ipv6.saddr));
+		tuple->ipv6.sport = sport;
+		tuple->ipv6.dport = dport;
+		return sizeof(tuple->ipv6);
+	}
+
+	default:
+		return 0;
+	}
+}
+
+static verdict_t classify_tcp(struct __sk_buff *skb, struct bpf_sock_tuple *tuple,
+			      uint64_t tuplen, void *iph, struct tcphdr *tcp)
+{
+	struct bpf_sock *sk =
+		bpf_skc_lookup_tcp(skb, tuple, tuplen, BPF_F_CURRENT_NETNS, 0);
+
+	if (sk == NULL)
+		return UNKNOWN;
+
+	if (sk->state != BPF_TCP_LISTEN) {
+		bpf_sk_release(sk);
+		return ESTABLISHED;
+	}
+
+	if (iph != NULL && tcp != NULL) {
+		/* Kludge: we've run out of arguments, but need the length of the ip header. */
+		uint64_t iphlen = sizeof(struct iphdr);
+
+		if (tuplen == sizeof(tuple->ipv6))
+			iphlen = sizeof(struct ipv6hdr);
+
+		if (bpf_tcp_check_syncookie(sk, iph, iphlen, tcp,
+					    sizeof(*tcp)) == 0) {
+			bpf_sk_release(sk);
+			return SYN_COOKIE;
+		}
+	}
+
+	bpf_sk_release(sk);
+	return UNKNOWN;
+}
+
+static verdict_t classify_udp(struct __sk_buff *skb, struct bpf_sock_tuple *tuple, uint64_t tuplen)
+{
+	struct bpf_sock *sk =
+		bpf_sk_lookup_udp(skb, tuple, tuplen, BPF_F_CURRENT_NETNS, 0);
+
+	if (sk == NULL)
+		return UNKNOWN;
+
+	if (sk->state == BPF_TCP_ESTABLISHED) {
+		bpf_sk_release(sk);
+		return ESTABLISHED;
+	}
+
+	bpf_sk_release(sk);
+	return UNKNOWN;
+}
+
+static verdict_t classify_icmp(struct __sk_buff *skb, uint8_t proto, struct bpf_sock_tuple *tuple,
+			       uint64_t tuplen, metrics_t *metrics)
+{
+	switch (proto) {
+	case IPPROTO_TCP:
+		return classify_tcp(skb, tuple, tuplen, NULL, NULL);
+
+	case IPPROTO_UDP:
+		return classify_udp(skb, tuple, tuplen);
+
+	default:
+		metrics->errors_total_malformed_icmp++;
+		return INVALID;
+	}
+}
+
+static verdict_t process_icmpv4(struct __sk_buff *skb, struct bpf_dynptr *dynptr, __u64 *offset,
+				metrics_t *metrics)
+{
+	struct icmphdr icmp;
+	struct iphdr ipv4;
+
+	if (bpf_dynptr_read(&icmp, sizeof(icmp), dynptr, *offset, 0)) {
+		metrics->errors_total_malformed_icmp++;
+		return INVALID;
+	}
+
+	*offset += sizeof(icmp);
+
+	/* We should never receive encapsulated echo replies. */
+	if (icmp.type == ICMP_ECHOREPLY) {
+		metrics->errors_total_icmp_echo_replies++;
+		return INVALID;
+	}
+
+	if (icmp.type == ICMP_ECHO)
+		return ECHO_REQUEST;
+
+	if (icmp.type != ICMP_DEST_UNREACH || icmp.code != ICMP_FRAG_NEEDED) {
+		metrics->errors_total_unwanted_icmp++;
+		return INVALID;
+	}
+
+	if (pkt_parse_ipv4(dynptr, offset, &ipv4)) {
+		metrics->errors_total_malformed_icmp_pkt_too_big++;
+		return INVALID;
+	}
+
+	/* The source address in the outer IP header is from the entity that
+	 * originated the ICMP message. Use the original IP header to restore
+	 * the correct flow tuple.
+	 */
+	struct bpf_sock_tuple tuple;
+	tuple.ipv4.saddr = ipv4.daddr;
+	tuple.ipv4.daddr = ipv4.saddr;
+
+	if (!pkt_parse_icmp_l4_ports(dynptr, offset, (flow_ports_t *)&tuple.ipv4.sport)) {
+		metrics->errors_total_malformed_icmp_pkt_too_big++;
+		return INVALID;
+	}
+
+	return classify_icmp(skb, ipv4.protocol, &tuple,
+			     sizeof(tuple.ipv4), metrics);
+}
+
+static verdict_t process_icmpv6(struct bpf_dynptr *dynptr, __u64 *offset, struct __sk_buff *skb,
+				metrics_t *metrics)
+{
+	struct bpf_sock_tuple tuple;
+	struct ipv6hdr ipv6;
+	struct icmp6hdr icmp6;
+	bool is_fragment;
+	uint8_t l4_proto;
+
+	if (bpf_dynptr_read(&icmp6, sizeof(icmp6), dynptr, *offset, 0)) {
+		metrics->errors_total_malformed_icmp++;
+		return INVALID;
+	}
+
+	/* We should never receive encapsulated echo replies. */
+	if (icmp6.icmp6_type == ICMPV6_ECHO_REPLY) {
+		metrics->errors_total_icmp_echo_replies++;
+		return INVALID;
+	}
+
+	if (icmp6.icmp6_type == ICMPV6_ECHO_REQUEST) {
+		return ECHO_REQUEST;
+	}
+
+	if (icmp6.icmp6_type != ICMPV6_PKT_TOOBIG) {
+		metrics->errors_total_unwanted_icmp++;
+		return INVALID;
+	}
+
+	if (pkt_parse_ipv6(dynptr, offset, &ipv6, &l4_proto, &is_fragment)) {
+		metrics->errors_total_malformed_icmp_pkt_too_big++;
+		return INVALID;
+	}
+
+	if (is_fragment) {
+		metrics->errors_total_fragmented_ip++;
+		return INVALID;
+	}
+
+	/* Swap source and dest addresses. */
+	memcpy(&tuple.ipv6.saddr, &ipv6.daddr, sizeof(tuple.ipv6.saddr));
+	memcpy(&tuple.ipv6.daddr, &ipv6.saddr, sizeof(tuple.ipv6.daddr));
+
+	if (!pkt_parse_icmp_l4_ports(dynptr, offset, (flow_ports_t *)&tuple.ipv6.sport)) {
+		metrics->errors_total_malformed_icmp_pkt_too_big++;
+		return INVALID;
+	}
+
+	return classify_icmp(skb, l4_proto, &tuple, sizeof(tuple.ipv6),
+			     metrics);
+}
+
+static verdict_t process_tcp(struct bpf_dynptr *dynptr, __u64 *offset, struct __sk_buff *skb,
+			     struct iphdr_info *info, metrics_t *metrics)
+{
+	struct bpf_sock_tuple tuple;
+	struct tcphdr tcp;
+	uint64_t tuplen;
+
+	metrics->l4_protocol_packets_total_tcp++;
+
+	if (bpf_dynptr_read(&tcp, sizeof(tcp), dynptr, *offset, 0)) {
+		metrics->errors_total_malformed_tcp++;
+		return INVALID;
+	}
+
+	*offset += sizeof(tcp);
+
+	if (tcp.syn)
+		return SYN;
+
+	tuplen = fill_tuple(&tuple, info->hdr, info->len, tcp.source, tcp.dest);
+	return classify_tcp(skb, &tuple, tuplen, info->hdr, &tcp);
+}
+
+static verdict_t process_udp(struct bpf_dynptr *dynptr, __u64 *offset, struct __sk_buff *skb,
+			     struct iphdr_info *info, metrics_t *metrics)
+{
+	struct bpf_sock_tuple tuple;
+	struct udphdr udph;
+	uint64_t tuplen;
+
+	metrics->l4_protocol_packets_total_udp++;
+
+	if (bpf_dynptr_read(&udph, sizeof(udph), dynptr, *offset, 0)) {
+		metrics->errors_total_malformed_udp++;
+		return INVALID;
+	}
+	*offset += sizeof(udph);
+
+	tuplen = fill_tuple(&tuple, info->hdr, info->len, udph.source, udph.dest);
+	return classify_udp(skb, &tuple, tuplen);
+}
+
+static verdict_t process_ipv4(struct __sk_buff *skb, struct bpf_dynptr *dynptr,
+			      __u64 *offset, metrics_t *metrics)
+{
+	struct iphdr ipv4;
+	struct iphdr_info info = {
+		.hdr = &ipv4,
+		.len = sizeof(ipv4),
+	};
+
+	metrics->l3_protocol_packets_total_ipv4++;
+
+	if (pkt_parse_ipv4(dynptr, offset, &ipv4)) {
+		metrics->errors_total_malformed_ip++;
+		return INVALID;
+	}
+
+	if (ipv4.version != 4) {
+		metrics->errors_total_malformed_ip++;
+		return INVALID;
+	}
+
+	if (ipv4_is_fragment(&ipv4)) {
+		metrics->errors_total_fragmented_ip++;
+		return INVALID;
+	}
+
+	switch (ipv4.protocol) {
+	case IPPROTO_ICMP:
+		return process_icmpv4(skb, dynptr, offset, metrics);
+
+	case IPPROTO_TCP:
+		return process_tcp(dynptr, offset, skb, &info, metrics);
+
+	case IPPROTO_UDP:
+		return process_udp(dynptr, offset, skb, &info, metrics);
+
+	default:
+		metrics->errors_total_unknown_l4_proto++;
+		return INVALID;
+	}
+}
+
+static verdict_t process_ipv6(struct __sk_buff *skb, struct bpf_dynptr *dynptr,
+			      __u64 *offset, metrics_t *metrics)
+{
+	struct ipv6hdr ipv6;
+	struct iphdr_info info = {
+		.hdr = &ipv6,
+		.len = sizeof(ipv6),
+	};
+	uint8_t l4_proto;
+	bool is_fragment;
+
+	metrics->l3_protocol_packets_total_ipv6++;
+
+	if (pkt_parse_ipv6(dynptr, offset, &ipv6, &l4_proto, &is_fragment)) {
+		metrics->errors_total_malformed_ip++;
+		return INVALID;
+	}
+
+	if (ipv6.version != 6) {
+		metrics->errors_total_malformed_ip++;
+		return INVALID;
+	}
+
+	if (is_fragment) {
+		metrics->errors_total_fragmented_ip++;
+		return INVALID;
+	}
+
+	switch (l4_proto) {
+	case IPPROTO_ICMPV6:
+		return process_icmpv6(dynptr, offset, skb, metrics);
+
+	case IPPROTO_TCP:
+		return process_tcp(dynptr, offset, skb, &info, metrics);
+
+	case IPPROTO_UDP:
+		return process_udp(dynptr, offset, skb, &info, metrics);
+
+	default:
+		metrics->errors_total_unknown_l4_proto++;
+		return INVALID;
+	}
+}
+
+SEC("tc")
+int cls_redirect(struct __sk_buff *skb)
+{
+	struct bpf_dynptr dynptr;
+	struct in_addr next_hop;
+	/* Tracks offset of the dynptr. This will be unnecessary once
+	 * bpf_dynptr_advance() is available.
+	 */
+	__u64 off = 0;
+
+	bpf_dynptr_from_skb(skb, 0, &dynptr);
+
+	metrics_t *metrics = get_global_metrics();
+	if (metrics == NULL)
+		return TC_ACT_SHOT;
+
+	metrics->processed_packets_total++;
+
+	/* Pass bogus packets as long as we're not sure they're
+	 * destined for us.
+	 */
+	if (skb->protocol != bpf_htons(ETH_P_IP))
+		return TC_ACT_OK;
+
+	encap_headers_t *encap;
+
+	/* Make sure that all encapsulation headers are available in
+	 * the linear portion of the skb. This makes it easy to manipulate them.
+	 */
+	if (bpf_skb_pull_data(skb, sizeof(*encap)))
+		return TC_ACT_OK;
+
+	encap = bpf_dynptr_data(&dynptr, 0, sizeof(*encap));
+	if (!encap)
+		return TC_ACT_OK;
+
+	off += sizeof(*encap);
+
+	if (encap->ip.ihl != 5)
+		/* We never have any options. */
+		return TC_ACT_OK;
+
+	if (encap->ip.daddr != ENCAPSULATION_IP ||
+	    encap->ip.protocol != IPPROTO_UDP)
+		return TC_ACT_OK;
+
+	/* TODO Check UDP length? */
+	if (encap->udp.dest != ENCAPSULATION_PORT)
+		return TC_ACT_OK;
+
+	/* We now know that the packet is destined to us, we can
+	 * drop bogus ones.
+	 */
+	if (ipv4_is_fragment((void *)&encap->ip)) {
+		metrics->errors_total_fragmented_ip++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->gue.variant != 0) {
+		metrics->errors_total_malformed_encapsulation++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->gue.control != 0) {
+		metrics->errors_total_malformed_encapsulation++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->gue.flags != 0) {
+		metrics->errors_total_malformed_encapsulation++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->gue.hlen !=
+	    sizeof(encap->unigue) / 4 + encap->unigue.hop_count) {
+		metrics->errors_total_malformed_encapsulation++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->unigue.version != 0) {
+		metrics->errors_total_malformed_encapsulation++;
+		return TC_ACT_SHOT;
+	}
+
+	if (encap->unigue.reserved != 0)
+		return TC_ACT_SHOT;
+
+	MAYBE_RETURN(get_next_hop(&dynptr, &off, encap, &next_hop));
+
+	if (next_hop.s_addr == 0) {
+		metrics->accepted_packets_total_last_hop++;
+		return accept_locally(skb, encap);
+	}
+
+	verdict_t verdict;
+	switch (encap->gue.proto_ctype) {
+	case IPPROTO_IPIP:
+		verdict = process_ipv4(skb, &dynptr, &off, metrics);
+		break;
+
+	case IPPROTO_IPV6:
+		verdict = process_ipv6(skb, &dynptr, &off, metrics);
+		break;
+
+	default:
+		metrics->errors_total_unknown_l3_proto++;
+		return TC_ACT_SHOT;
+	}
+
+	switch (verdict) {
+	case INVALID:
+		/* metrics have already been bumped */
+		return TC_ACT_SHOT;
+
+	case UNKNOWN:
+		return forward_to_next_hop(skb, &dynptr, encap, &next_hop, metrics);
+
+	case ECHO_REQUEST:
+		metrics->accepted_packets_total_icmp_echo_request++;
+		break;
+
+	case SYN:
+		if (encap->unigue.forward_syn) {
+			return forward_to_next_hop(skb, &dynptr, encap, &next_hop,
+						   metrics);
+		}
+
+		metrics->accepted_packets_total_syn++;
+		break;
+
+	case SYN_COOKIE:
+		metrics->accepted_packets_total_syn_cookies++;
+		break;
+
+	case ESTABLISHED:
+		metrics->accepted_packets_total_established++;
+		break;
+	}
+
+	return accept_locally(skb, encap);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c b/tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c
new file mode 100644
index 000000000000..714b99e2d8b6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c
@@ -0,0 +1,468 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (c) 2017 Facebook
+#include <stddef.h>
+#include <stdbool.h>
+#include <string.h>
+#include <linux/pkt_cls.h>
+#include <linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/icmp.h>
+#include <linux/icmpv6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <bpf/bpf_helpers.h>
+#include "test_iptunnel_common.h"
+#include <bpf/bpf_endian.h>
+
+static __always_inline __u32 rol32(__u32 word, unsigned int shift)
+{
+	return (word << shift) | (word >> ((-shift) & 31));
+}
+
+/* copy paste of jhash from kernel sources to make sure llvm
+ * can compile it into valid sequence of bpf instructions
+ */
+#define __jhash_mix(a, b, c)			\
+{						\
+	a -= c;  a ^= rol32(c, 4);  c += b;	\
+	b -= a;  b ^= rol32(a, 6);  a += c;	\
+	c -= b;  c ^= rol32(b, 8);  b += a;	\
+	a -= c;  a ^= rol32(c, 16); c += b;	\
+	b -= a;  b ^= rol32(a, 19); a += c;	\
+	c -= b;  c ^= rol32(b, 4);  b += a;	\
+}
+
+#define __jhash_final(a, b, c)			\
+{						\
+	c ^= b; c -= rol32(b, 14);		\
+	a ^= c; a -= rol32(c, 11);		\
+	b ^= a; b -= rol32(a, 25);		\
+	c ^= b; c -= rol32(b, 16);		\
+	a ^= c; a -= rol32(c, 4);		\
+	b ^= a; b -= rol32(a, 14);		\
+	c ^= b; c -= rol32(b, 24);		\
+}
+
+#define JHASH_INITVAL		0xdeadbeef
+
+typedef unsigned int u32;
+
+static __noinline u32 jhash(const void *key, u32 length, u32 initval)
+{
+	u32 a, b, c;
+	const unsigned char *k = key;
+
+	a = b = c = JHASH_INITVAL + length + initval;
+
+	while (length > 12) {
+		a += *(u32 *)(k);
+		b += *(u32 *)(k + 4);
+		c += *(u32 *)(k + 8);
+		__jhash_mix(a, b, c);
+		length -= 12;
+		k += 12;
+	}
+	switch (length) {
+	case 12: c += (u32)k[11]<<24;
+	case 11: c += (u32)k[10]<<16;
+	case 10: c += (u32)k[9]<<8;
+	case 9:  c += k[8];
+	case 8:  b += (u32)k[7]<<24;
+	case 7:  b += (u32)k[6]<<16;
+	case 6:  b += (u32)k[5]<<8;
+	case 5:  b += k[4];
+	case 4:  a += (u32)k[3]<<24;
+	case 3:  a += (u32)k[2]<<16;
+	case 2:  a += (u32)k[1]<<8;
+	case 1:  a += k[0];
+		 __jhash_final(a, b, c);
+	case 0: /* Nothing left to add */
+		break;
+	}
+
+	return c;
+}
+
+static __noinline u32 __jhash_nwords(u32 a, u32 b, u32 c, u32 initval)
+{
+	a += initval;
+	b += initval;
+	c += initval;
+	__jhash_final(a, b, c);
+	return c;
+}
+
+static __noinline u32 jhash_2words(u32 a, u32 b, u32 initval)
+{
+	return __jhash_nwords(a, b, 0, initval + JHASH_INITVAL + (2 << 2));
+}
+
+#define PCKT_FRAGMENTED 65343
+#define IPV4_HDR_LEN_NO_OPT 20
+#define IPV4_PLUS_ICMP_HDR 28
+#define IPV6_PLUS_ICMP_HDR 48
+#define RING_SIZE 2
+#define MAX_VIPS 12
+#define MAX_REALS 5
+#define CTL_MAP_SIZE 16
+#define CH_RINGS_SIZE (MAX_VIPS * RING_SIZE)
+#define F_IPV6 (1 << 0)
+#define F_HASH_NO_SRC_PORT (1 << 0)
+#define F_ICMP (1 << 0)
+#define F_SYN_SET (1 << 1)
+
+struct packet_description {
+	union {
+		__be32 src;
+		__be32 srcv6[4];
+	};
+	union {
+		__be32 dst;
+		__be32 dstv6[4];
+	};
+	union {
+		__u32 ports;
+		__u16 port16[2];
+	};
+	__u8 proto;
+	__u8 flags;
+};
+
+struct ctl_value {
+	union {
+		__u64 value;
+		__u32 ifindex;
+		__u8 mac[6];
+	};
+};
+
+struct vip_meta {
+	__u32 flags;
+	__u32 vip_num;
+};
+
+struct real_definition {
+	union {
+		__be32 dst;
+		__be32 dstv6[4];
+	};
+	__u8 flags;
+};
+
+struct vip_stats {
+	__u64 bytes;
+	__u64 pkts;
+};
+
+struct eth_hdr {
+	unsigned char eth_dest[ETH_ALEN];
+	unsigned char eth_source[ETH_ALEN];
+	unsigned short eth_proto;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, MAX_VIPS);
+	__type(key, struct vip);
+	__type(value, struct vip_meta);
+} vip_map SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, CH_RINGS_SIZE);
+	__type(key, __u32);
+	__type(value, __u32);
+} ch_rings SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, MAX_REALS);
+	__type(key, __u32);
+	__type(value, struct real_definition);
+} reals SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, MAX_VIPS);
+	__type(key, __u32);
+	__type(value, struct vip_stats);
+} stats SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, CTL_MAP_SIZE);
+	__type(key, __u32);
+	__type(value, struct ctl_value);
+} ctl_array SEC(".maps");
+
+static __noinline __u32 get_packet_hash(struct packet_description *pckt, bool ipv6)
+{
+	if (ipv6)
+		return jhash_2words(jhash(pckt->srcv6, 16, MAX_VIPS),
+				    pckt->ports, CH_RINGS_SIZE);
+	else
+		return jhash_2words(pckt->src, pckt->ports, CH_RINGS_SIZE);
+}
+
+static __noinline bool get_packet_dst(struct real_definition **real,
+				      struct packet_description *pckt,
+				      struct vip_meta *vip_info,
+				      bool is_ipv6)
+{
+	__u32 hash = get_packet_hash(pckt, is_ipv6);
+	__u32 key = RING_SIZE * vip_info->vip_num + hash % RING_SIZE;
+	__u32 *real_pos;
+
+	if (hash != 0x358459b7 /* jhash of ipv4 packet */  &&
+	    hash != 0x2f4bc6bb /* jhash of ipv6 packet */)
+		return false;
+
+	real_pos = bpf_map_lookup_elem(&ch_rings, &key);
+	if (!real_pos)
+		return false;
+	key = *real_pos;
+	*real = bpf_map_lookup_elem(&reals, &key);
+	if (!(*real))
+		return false;
+	return true;
+}
+
+static __noinline int parse_icmpv6(struct bpf_dynptr *skb_ptr, __u64 off,
+				   struct packet_description *pckt)
+{
+	struct icmp6hdr *icmp_hdr;
+	struct ipv6hdr *ip6h;
+
+	icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr));
+	if (!icmp_hdr)
+		return TC_ACT_SHOT;
+
+	if (icmp_hdr->icmp6_type != ICMPV6_PKT_TOOBIG)
+		return TC_ACT_OK;
+	off += sizeof(struct icmp6hdr);
+	ip6h = (struct ipv6hdr *)bpf_dynptr_data(skb_ptr, off, sizeof(*ip6h));
+	if (!ip6h)
+		return TC_ACT_SHOT;
+	pckt->proto = ip6h->nexthdr;
+	pckt->flags |= F_ICMP;
+	memcpy(pckt->srcv6, ip6h->daddr.s6_addr32, 16);
+	memcpy(pckt->dstv6, ip6h->saddr.s6_addr32, 16);
+	return TC_ACT_UNSPEC;
+}
+
+static __noinline int parse_icmp(struct bpf_dynptr *skb_ptr, __u64 off,
+				 struct packet_description *pckt)
+{
+	struct icmphdr *icmp_hdr;
+	struct iphdr *iph;
+
+	icmp_hdr = bpf_dynptr_data(skb_ptr, off, sizeof(*icmp_hdr));
+	if (!icmp_hdr)
+		return TC_ACT_SHOT;
+	if (icmp_hdr->type != ICMP_DEST_UNREACH ||
+	    icmp_hdr->code != ICMP_FRAG_NEEDED)
+		return TC_ACT_OK;
+	off += sizeof(struct icmphdr);
+	iph = bpf_dynptr_data(skb_ptr, off, sizeof(*iph));
+	if (!iph || iph->ihl != 5)
+		return TC_ACT_SHOT;
+	pckt->proto = iph->protocol;
+	pckt->flags |= F_ICMP;
+	pckt->src = iph->daddr;
+	pckt->dst = iph->saddr;
+	return TC_ACT_UNSPEC;
+}
+
+static __noinline bool parse_udp(struct bpf_dynptr *skb_ptr, __u64 off,
+				 struct packet_description *pckt)
+{
+	struct udphdr *udp;
+
+	udp = bpf_dynptr_data(skb_ptr, off, sizeof(*udp));
+	if (!udp)
+		return false;
+
+	if (!(pckt->flags & F_ICMP)) {
+		pckt->port16[0] = udp->source;
+		pckt->port16[1] = udp->dest;
+	} else {
+		pckt->port16[0] = udp->dest;
+		pckt->port16[1] = udp->source;
+	}
+	return true;
+}
+
+static __noinline bool parse_tcp(struct bpf_dynptr *skb_ptr, __u64 off,
+				 struct packet_description *pckt)
+{
+	struct tcphdr *tcp;
+
+	tcp = bpf_dynptr_data(skb_ptr, off, sizeof(*tcp));
+	if (!tcp)
+		return false;
+
+	if (tcp->syn)
+		pckt->flags |= F_SYN_SET;
+
+	if (!(pckt->flags & F_ICMP)) {
+		pckt->port16[0] = tcp->source;
+		pckt->port16[1] = tcp->dest;
+	} else {
+		pckt->port16[0] = tcp->dest;
+		pckt->port16[1] = tcp->source;
+	}
+	return true;
+}
+
+static __noinline int process_packet(struct bpf_dynptr *skb_ptr,
+				     struct eth_hdr *eth, __u64 off,
+				     bool is_ipv6, struct __sk_buff *skb)
+{
+	struct packet_description pckt = {};
+	struct bpf_tunnel_key tkey = {};
+	struct vip_stats *data_stats;
+	struct real_definition *dst;
+	struct vip_meta *vip_info;
+	struct ctl_value *cval;
+	__u32 v4_intf_pos = 1;
+	__u32 v6_intf_pos = 2;
+	struct ipv6hdr *ip6h;
+	struct vip vip = {};
+	struct iphdr *iph;
+	int tun_flag = 0;
+	__u16 pkt_bytes;
+	__u64 iph_len;
+	__u32 ifindex;
+	__u8 protocol;
+	__u32 vip_num;
+	int action;
+
+	tkey.tunnel_ttl = 64;
+	if (is_ipv6) {
+		ip6h = bpf_dynptr_data(skb_ptr, off, sizeof(*ip6h));
+		if (!ip6h)
+			return TC_ACT_SHOT;
+
+		iph_len = sizeof(struct ipv6hdr);
+		protocol = ip6h->nexthdr;
+		pckt.proto = protocol;
+		pkt_bytes = bpf_ntohs(ip6h->payload_len);
+		off += iph_len;
+		if (protocol == IPPROTO_FRAGMENT) {
+			return TC_ACT_SHOT;
+		} else if (protocol == IPPROTO_ICMPV6) {
+			action = parse_icmpv6(skb_ptr, off, &pckt);
+			if (action >= 0)
+				return action;
+			off += IPV6_PLUS_ICMP_HDR;
+		} else {
+			memcpy(pckt.srcv6, ip6h->saddr.s6_addr32, 16);
+			memcpy(pckt.dstv6, ip6h->daddr.s6_addr32, 16);
+		}
+	} else {
+		iph = bpf_dynptr_data(skb_ptr, off, sizeof(*iph));
+		if (!iph || iph->ihl != 5)
+			return TC_ACT_SHOT;
+
+		protocol = iph->protocol;
+		pckt.proto = protocol;
+		pkt_bytes = bpf_ntohs(iph->tot_len);
+		off += IPV4_HDR_LEN_NO_OPT;
+
+		if (iph->frag_off & PCKT_FRAGMENTED)
+			return TC_ACT_SHOT;
+		if (protocol == IPPROTO_ICMP) {
+			action = parse_icmp(skb_ptr, off, &pckt);
+			if (action >= 0)
+				return action;
+			off += IPV4_PLUS_ICMP_HDR;
+		} else {
+			pckt.src = iph->saddr;
+			pckt.dst = iph->daddr;
+		}
+	}
+	protocol = pckt.proto;
+
+	if (protocol == IPPROTO_TCP) {
+		if (!parse_tcp(skb_ptr, off, &pckt))
+			return TC_ACT_SHOT;
+	} else if (protocol == IPPROTO_UDP) {
+		if (!parse_udp(skb_ptr, off, &pckt))
+			return TC_ACT_SHOT;
+	} else {
+		return TC_ACT_SHOT;
+	}
+
+	if (is_ipv6)
+		memcpy(vip.daddr.v6, pckt.dstv6, 16);
+	else
+		vip.daddr.v4 = pckt.dst;
+
+	vip.dport = pckt.port16[1];
+	vip.protocol = pckt.proto;
+	vip_info = bpf_map_lookup_elem(&vip_map, &vip);
+	if (!vip_info) {
+		vip.dport = 0;
+		vip_info = bpf_map_lookup_elem(&vip_map, &vip);
+		if (!vip_info)
+			return TC_ACT_SHOT;
+		pckt.port16[1] = 0;
+	}
+
+	if (vip_info->flags & F_HASH_NO_SRC_PORT)
+		pckt.port16[0] = 0;
+
+	if (!get_packet_dst(&dst, &pckt, vip_info, is_ipv6))
+		return TC_ACT_SHOT;
+
+	if (dst->flags & F_IPV6) {
+		cval = bpf_map_lookup_elem(&ctl_array, &v6_intf_pos);
+		if (!cval)
+			return TC_ACT_SHOT;
+		ifindex = cval->ifindex;
+		memcpy(tkey.remote_ipv6, dst->dstv6, 16);
+		tun_flag = BPF_F_TUNINFO_IPV6;
+	} else {
+		cval = bpf_map_lookup_elem(&ctl_array, &v4_intf_pos);
+		if (!cval)
+			return TC_ACT_SHOT;
+		ifindex = cval->ifindex;
+		tkey.remote_ipv4 = dst->dst;
+	}
+	vip_num = vip_info->vip_num;
+	data_stats = bpf_map_lookup_elem(&stats, &vip_num);
+	if (!data_stats)
+		return TC_ACT_SHOT;
+	data_stats->pkts++;
+	data_stats->bytes += pkt_bytes;
+	bpf_skb_set_tunnel_key(skb, &tkey, sizeof(tkey), tun_flag);
+	*(u32 *)eth->eth_dest = tkey.remote_ipv4;
+	return bpf_redirect(ifindex, 0);
+}
+
+SEC("tc")
+int balancer_ingress(struct __sk_buff *ctx)
+{
+	struct bpf_dynptr ptr;
+	struct eth_hdr *eth;
+	__u32 eth_proto;
+	__u32 nh_off;
+
+	nh_off = sizeof(struct eth_hdr);
+
+	bpf_dynptr_from_skb(ctx, 0, &ptr);
+	eth = bpf_dynptr_data(&ptr, 0, sizeof(*eth));
+	if (!eth)
+		return TC_ACT_SHOT;
+	eth_proto = eth->eth_proto;
+	if (eth_proto == bpf_htons(ETH_P_IP))
+		return process_packet(&ptr, eth, nh_off, false, ctx);
+	else if (eth_proto == bpf_htons(ETH_P_IPV6))
+		return process_packet(&ptr, eth, nh_off, true, ctx);
+	else
+		return TC_ACT_SHOT;
+}
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c
new file mode 100644
index 000000000000..79bab9b50e9e
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c
@@ -0,0 +1,119 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* This parsing logic is taken from the open source library katran, a layer 4
+ * load balancer.
+ *
+ * This code logic using dynptrs can be found in test_parse_tcp_hdr_opt_dynptr.c
+ *
+ * https://github.com/facebookincubator/katran/blob/main/katran/lib/bpf/pckt_parsing.h
+ */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <linux/tcp.h>
+#include <stdbool.h>
+#include <linux/ipv6.h>
+#include <linux/if_ether.h>
+#include "test_tcp_hdr_options.h"
+
+char _license[] SEC("license") = "GPL";
+
+/* Kind number used for experiments */
+const __u32 tcp_hdr_opt_kind_tpr = 0xFD;
+/* Length of the tcp header option */
+const __u32 tcp_hdr_opt_len_tpr = 6;
+/* maximum number of header options to check to lookup server_id */
+const __u32 tcp_hdr_opt_max_opt_checks = 15;
+
+__u32 server_id;
+
+struct hdr_opt_state {
+	__u32 server_id;
+	__u8 byte_offset;
+	__u8 hdr_bytes_remaining;
+};
+
+static int parse_hdr_opt(const struct xdp_md *xdp, struct hdr_opt_state *state)
+{
+	const void *data = (void *)(long)xdp->data;
+	const void *data_end = (void *)(long)xdp->data_end;
+	__u8 *tcp_opt, kind, hdr_len;
+
+	tcp_opt = (__u8 *)(data + state->byte_offset);
+	if (tcp_opt + 1 > data_end)
+		return -1;
+
+	kind = tcp_opt[0];
+
+	if (kind == TCPOPT_EOL)
+		return -1;
+
+	if (kind == TCPOPT_NOP) {
+		state->hdr_bytes_remaining--;
+		state->byte_offset++;
+		return 0;
+	}
+
+	if (state->hdr_bytes_remaining < 2 ||
+	    tcp_opt + sizeof(__u8) + sizeof(__u8) > data_end)
+		return -1;
+
+	hdr_len = tcp_opt[1];
+	if (hdr_len > state->hdr_bytes_remaining)
+		return -1;
+
+	if (kind == tcp_hdr_opt_kind_tpr) {
+		if (hdr_len != tcp_hdr_opt_len_tpr)
+			return -1;
+
+		if (tcp_opt + tcp_hdr_opt_len_tpr > data_end)
+			return -1;
+
+		state->server_id = *(__u32 *)&tcp_opt[2];
+		return 1;
+	}
+
+	state->hdr_bytes_remaining -= hdr_len;
+	state->byte_offset += hdr_len;
+	return 0;
+}
+
+SEC("xdp")
+int xdp_ingress_v6(struct xdp_md *xdp)
+{
+	const void *data = (void *)(long)xdp->data;
+	const void *data_end = (void *)(long)xdp->data_end;
+	struct hdr_opt_state opt_state = {};
+	__u8 tcp_hdr_opt_len = 0;
+	struct tcphdr *tcp_hdr;
+	__u64 tcp_offset = 0;
+	__u32 off;
+	int err;
+
+	tcp_offset = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
+	tcp_hdr = (struct tcphdr *)(data + tcp_offset);
+	if (tcp_hdr + 1 > data_end)
+		return XDP_DROP;
+
+	tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr);
+	if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr)
+		return XDP_DROP;
+
+	opt_state.hdr_bytes_remaining = tcp_hdr_opt_len;
+	opt_state.byte_offset = sizeof(struct tcphdr) + tcp_offset;
+
+	/* max number of bytes of options in tcp header is 40 bytes */
+	for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) {
+		err = parse_hdr_opt(xdp, &opt_state);
+
+		if (err || !opt_state.hdr_bytes_remaining)
+			break;
+	}
+
+	if (!opt_state.server_id)
+		return XDP_DROP;
+
+	server_id = opt_state.server_id;
+
+	return XDP_PASS;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c
new file mode 100644
index 000000000000..0b97a6708fc5
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* This logic is lifted from a real-world use case of packet parsing, used in
+ * the open source library katran, a layer 4 load balancer.
+ *
+ * This test demonstrates how to parse packet contents using dynptrs. The
+ * original code (parsing without dynptrs) can be found in test_parse_tcp_hdr_opt.c
+ */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+#include <linux/tcp.h>
+#include <stdbool.h>
+#include <linux/ipv6.h>
+#include <linux/if_ether.h>
+#include "test_tcp_hdr_options.h"
+
+char _license[] SEC("license") = "GPL";
+
+/* Kind number used for experiments */
+const __u32 tcp_hdr_opt_kind_tpr = 0xFD;
+/* Length of the tcp header option */
+const __u32 tcp_hdr_opt_len_tpr = 6;
+/* maximum number of header options to check to lookup server_id */
+const __u32 tcp_hdr_opt_max_opt_checks = 15;
+
+__u32 server_id;
+
+static int parse_hdr_opt(struct bpf_dynptr *ptr, __u32 *off, __u8 *hdr_bytes_remaining,
+			 __u32 *server_id)
+{
+	__u8 *tcp_opt, kind, hdr_len;
+	__u8 *data;
+
+	data = bpf_dynptr_data(ptr, *off, sizeof(kind) + sizeof(hdr_len) +
+			       sizeof(*server_id));
+	if (!data)
+		return -1;
+
+	kind = data[0];
+
+	if (kind == TCPOPT_EOL)
+		return -1;
+
+	if (kind == TCPOPT_NOP) {
+		*off += 1;
+		*hdr_bytes_remaining -= 1;
+		return 0;
+	}
+
+	if (*hdr_bytes_remaining < 2)
+		return -1;
+
+	hdr_len = data[1];
+	if (hdr_len > *hdr_bytes_remaining)
+		return -1;
+
+	if (kind == tcp_hdr_opt_kind_tpr) {
+		if (hdr_len != tcp_hdr_opt_len_tpr)
+			return -1;
+
+		__builtin_memcpy(server_id, (__u32 *)(data + 2), sizeof(*server_id));
+		return 1;
+	}
+
+	*off += hdr_len;
+	*hdr_bytes_remaining -= hdr_len;
+	return 0;
+}
+
+SEC("xdp")
+int xdp_ingress_v6(struct xdp_md *xdp)
+{
+	__u8 hdr_bytes_remaining;
+	struct tcphdr *tcp_hdr;
+	__u8 tcp_hdr_opt_len;
+	int err = 0;
+	__u32 off;
+
+	struct bpf_dynptr ptr;
+
+	bpf_dynptr_from_xdp(xdp, 0, &ptr);
+
+	off = sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
+
+	tcp_hdr = bpf_dynptr_data(&ptr, off, sizeof(*tcp_hdr));
+	if (!tcp_hdr)
+		return XDP_DROP;
+
+	tcp_hdr_opt_len = (tcp_hdr->doff * 4) - sizeof(struct tcphdr);
+	if (tcp_hdr_opt_len < tcp_hdr_opt_len_tpr)
+		return XDP_DROP;
+
+	hdr_bytes_remaining = tcp_hdr_opt_len;
+
+	off += sizeof(struct tcphdr);
+
+	/* max number of bytes of options in tcp header is 40 bytes */
+	for (int i = 0; i < tcp_hdr_opt_max_opt_checks; i++) {
+		err = parse_hdr_opt(&ptr, &off, &hdr_bytes_remaining, &server_id);
+
+		if (err || !hdr_bytes_remaining)
+			break;
+	}
+
+	if (!server_id)
+		return XDP_DROP;
+
+	return XDP_PASS;
+}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_dynptr.c b/tools/testing/selftests/bpf/progs/test_xdp_dynptr.c
new file mode 100644
index 000000000000..f9e1c36f9471
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_xdp_dynptr.c
@@ -0,0 +1,240 @@
+/* Copyright (c) 2022 Meta
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ */
+#include <stddef.h>
+#include <string.h>
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/in.h>
+#include <linux/udp.h>
+#include <linux/tcp.h>
+#include <linux/pkt_cls.h>
+#include <sys/socket.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+#include "test_iptunnel_common.h"
+
+const size_t tcphdr_sz = sizeof(struct tcphdr);
+const size_t udphdr_sz = sizeof(struct udphdr);
+const size_t ethhdr_sz = sizeof(struct ethhdr);
+const size_t iphdr_sz = sizeof(struct iphdr);
+const size_t ipv6hdr_sz = sizeof(struct ipv6hdr);
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__uint(max_entries, 256);
+	__type(key, __u32);
+	__type(value, __u64);
+} rxcnt SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, MAX_IPTNL_ENTRIES);
+	__type(key, struct vip);
+	__type(value, struct iptnl_info);
+} vip2tnl SEC(".maps");
+
+static __always_inline void count_tx(__u32 protocol)
+{
+	__u64 *rxcnt_count;
+
+	rxcnt_count = bpf_map_lookup_elem(&rxcnt, &protocol);
+	if (rxcnt_count)
+		*rxcnt_count += 1;
+}
+
+static __always_inline int get_dport(void *trans_data, __u8 protocol)
+{
+	struct tcphdr *th;
+	struct udphdr *uh;
+
+	switch (protocol) {
+	case IPPROTO_TCP:
+		th = (struct tcphdr *)trans_data;
+		return th->dest;
+	case IPPROTO_UDP:
+		uh = (struct udphdr *)trans_data;
+		return uh->dest;
+	default:
+		return 0;
+	}
+}
+
+static __always_inline void set_ethhdr(struct ethhdr *new_eth,
+				       const struct ethhdr *old_eth,
+				       const struct iptnl_info *tnl,
+				       __be16 h_proto)
+{
+	memcpy(new_eth->h_source, old_eth->h_dest, sizeof(new_eth->h_source));
+	memcpy(new_eth->h_dest, tnl->dmac, sizeof(new_eth->h_dest));
+	new_eth->h_proto = h_proto;
+}
+
+static __always_inline int handle_ipv4(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
+{
+	struct bpf_dynptr new_xdp_ptr;
+	struct iptnl_info *tnl;
+	struct ethhdr *new_eth;
+	struct ethhdr *old_eth;
+	__u32 transport_hdr_sz;
+	struct iphdr *iph;
+	__u16 *next_iph;
+	__u16 payload_len;
+	struct vip vip = {};
+	int dport;
+	__u32 csum = 0;
+	int i;
+
+	if (ethhdr_sz + iphdr_sz + tcphdr_sz > xdp->data_end - xdp->data)
+		transport_hdr_sz = udphdr_sz;
+	else
+		transport_hdr_sz = tcphdr_sz;
+
+	iph = bpf_dynptr_data(xdp_ptr, ethhdr_sz, iphdr_sz + transport_hdr_sz);
+	if (!iph)
+		return XDP_DROP;
+
+	dport = get_dport(iph + 1, iph->protocol);
+	if (dport == -1)
+		return XDP_DROP;
+
+	vip.protocol = iph->protocol;
+	vip.family = AF_INET;
+	vip.daddr.v4 = iph->daddr;
+	vip.dport = dport;
+	payload_len = bpf_ntohs(iph->tot_len);
+
+	tnl = bpf_map_lookup_elem(&vip2tnl, &vip);
+	/* It only does v4-in-v4 */
+	if (!tnl || tnl->family != AF_INET)
+		return XDP_PASS;
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)iphdr_sz))
+		return XDP_DROP;
+
+	bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr);
+	new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + iphdr_sz + ethhdr_sz);
+	if (!new_eth)
+		return XDP_DROP;
+
+	iph = (struct iphdr *)(new_eth + 1);
+	old_eth = (struct ethhdr *)(iph + 1);
+
+	set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IP));
+
+	iph->version = 4;
+	iph->ihl = iphdr_sz >> 2;
+	iph->frag_off =	0;
+	iph->protocol = IPPROTO_IPIP;
+	iph->check = 0;
+	iph->tos = 0;
+	iph->tot_len = bpf_htons(payload_len + iphdr_sz);
+	iph->daddr = tnl->daddr.v4;
+	iph->saddr = tnl->saddr.v4;
+	iph->ttl = 8;
+
+	next_iph = (__u16 *)iph;
+#pragma clang loop unroll(full)
+	for (i = 0; i < iphdr_sz >> 1; i++)
+		csum += *next_iph++;
+
+	iph->check = ~((csum & 0xffff) + (csum >> 16));
+
+	count_tx(vip.protocol);
+
+	return XDP_TX;
+}
+
+static __always_inline int handle_ipv6(struct xdp_md *xdp, struct bpf_dynptr *xdp_ptr)
+{
+	struct bpf_dynptr new_xdp_ptr;
+	struct iptnl_info *tnl;
+	struct ethhdr *new_eth;
+	struct ethhdr *old_eth;
+	__u32 transport_hdr_sz;
+	struct ipv6hdr *ip6h;
+	__u16 payload_len;
+	struct vip vip = {};
+	int dport;
+
+	if (ethhdr_sz + iphdr_sz + tcphdr_sz > xdp->data_end - xdp->data)
+		transport_hdr_sz = udphdr_sz;
+	else
+		transport_hdr_sz = tcphdr_sz;
+
+	ip6h = bpf_dynptr_data(xdp_ptr, ethhdr_sz, ipv6hdr_sz + transport_hdr_sz);
+	if (!ip6h)
+		return XDP_DROP;
+
+	dport = get_dport(ip6h + 1, ip6h->nexthdr);
+	if (dport == -1)
+		return XDP_DROP;
+
+	vip.protocol = ip6h->nexthdr;
+	vip.family = AF_INET6;
+	memcpy(vip.daddr.v6, ip6h->daddr.s6_addr32, sizeof(vip.daddr));
+	vip.dport = dport;
+	payload_len = ip6h->payload_len;
+
+	tnl = bpf_map_lookup_elem(&vip2tnl, &vip);
+	/* It only does v6-in-v6 */
+	if (!tnl || tnl->family != AF_INET6)
+		return XDP_PASS;
+
+	if (bpf_xdp_adjust_head(xdp, 0 - (int)ipv6hdr_sz))
+		return XDP_DROP;
+
+	bpf_dynptr_from_xdp(xdp, 0, &new_xdp_ptr);
+	new_eth = bpf_dynptr_data(&new_xdp_ptr, 0, ethhdr_sz + ipv6hdr_sz + ethhdr_sz);
+	if (!new_eth)
+		return XDP_DROP;
+
+	ip6h = (struct ipv6hdr *)(new_eth + 1);
+	old_eth = (struct ethhdr *)(ip6h + 1);
+
+	set_ethhdr(new_eth, old_eth, tnl, bpf_htons(ETH_P_IPV6));
+
+	ip6h->version = 6;
+	ip6h->priority = 0;
+	memset(ip6h->flow_lbl, 0, sizeof(ip6h->flow_lbl));
+	ip6h->payload_len = bpf_htons(bpf_ntohs(payload_len) + ipv6hdr_sz);
+	ip6h->nexthdr = IPPROTO_IPV6;
+	ip6h->hop_limit = 8;
+	memcpy(ip6h->saddr.s6_addr32, tnl->saddr.v6, sizeof(tnl->saddr.v6));
+	memcpy(ip6h->daddr.s6_addr32, tnl->daddr.v6, sizeof(tnl->daddr.v6));
+
+	count_tx(vip.protocol);
+
+	return XDP_TX;
+}
+
+SEC("xdp")
+int _xdp_tx_iptunnel(struct xdp_md *xdp)
+{
+	struct bpf_dynptr ptr;
+	struct ethhdr *eth;
+	__u16 h_proto;
+
+	bpf_dynptr_from_xdp(xdp, 0, &ptr);
+	eth = bpf_dynptr_data(&ptr, 0, ethhdr_sz);
+	if (!eth)
+		return XDP_DROP;
+
+	h_proto = eth->h_proto;
+
+	if (h_proto == bpf_htons(ETH_P_IP))
+		return handle_ipv4(xdp, &ptr);
+	else if (h_proto == bpf_htons(ETH_P_IPV6))
+
+		return handle_ipv6(xdp, &ptr);
+	else
+		return XDP_DROP;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
index 6118e3ab61fc..56c9f8a3ad3d 100644
--- a/tools/testing/selftests/bpf/test_tcp_hdr_options.h
+++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h
@@ -50,6 +50,7 @@ struct linum_err {
 
 #define TCPOPT_EOL		0
 #define TCPOPT_NOP		1
+#define TCPOPT_MSS		2
 #define TCPOPT_WINDOW		3
 #define TCPOPT_EXP		254
 
-- 
2.30.2


^ permalink raw reply related

* [PATCH bpf-next v2 2/3] bpf: Add xdp dynptrs
From: Joanne Koong @ 2022-08-11 23:05 UTC (permalink / raw)
  To: bpf; +Cc: kafai, kuba, andrii, daniel, ast, netdev, kernel-team,
	Joanne Koong
In-Reply-To: <20220811230501.2632393-1-joannelkoong@gmail.com>

Add xdp dynptrs, which are dynptrs whose underlying pointer points
to a xdp_buff. The dynptr acts on xdp data. xdp dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of xdp->data and xdp->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For reads and writes on the dynptr, this includes reading/writing
from/to and across fragments. For data slices, direct access to
data in fragments is also permitted, but access across fragments
is not. The returned data slice is reg type PTR_TO_PACKET | PTR_MAYBE_NULL.

Any helper calls that change the underlying packet buffer (eg
bpf_xdp_adjust_head) invalidates any data slices of the associated
dynptr. Whenever such a helper call is made, the verifier marks any
PTR_TO_PACKET reg type (which includes xdp dynptr slices since they are
PTR_TO_PACKETs) as unknown. The stack trace for this is
check_helper_call() -> clear_all_pkt_pointers() ->
__clear_all_pkt_pointers() -> mark_reg_unknown()

For examples of how xdp dynptrs can be used, please see the attached
selftests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/linux/bpf.h            |  8 +++++-
 include/linux/filter.h         |  3 +++
 include/uapi/linux/bpf.h       | 25 +++++++++++++++---
 kernel/bpf/helpers.c           | 14 ++++++++++-
 kernel/bpf/verifier.c          |  8 +++++-
 net/core/filter.c              | 46 +++++++++++++++++++++++++++++-----
 tools/include/uapi/linux/bpf.h | 25 +++++++++++++++---
 7 files changed, 114 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index be9fb563c318..bef7af58588e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -410,11 +410,15 @@ enum bpf_type_flag {
 	/* DYNPTR points to sk_buff */
 	DYNPTR_TYPE_SKB		= BIT(11 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to xdp_buff */
+	DYNPTR_TYPE_XDP		= BIT(12 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
-#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
+#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB \
+				 | DYNPTR_TYPE_XDP)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
@@ -2562,6 +2566,8 @@ enum bpf_dynptr_type {
 	BPF_DYNPTR_TYPE_RINGBUF,
 	/* Underlying data is a sk_buff */
 	BPF_DYNPTR_TYPE_SKB,
+	/* Underlying data is a xdp_buff */
+	BPF_DYNPTR_TYPE_XDP,
 };
 
 void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 649063d9cbfd..80f030239877 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1535,5 +1535,8 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
 int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
 			  u32 len, u64 flags);
+int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
+int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len);
+void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len);
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a6bca9619d41..681f75c2afdd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5281,13 +5281,18 @@ union bpf_attr {
  *		      and try again.
  *
  *		    * The data slice is automatically invalidated anytime
- *		      **bpf_dynptr_write**\ () or a helper call that changes
- *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      **bpf_dynptr_write**\ () is called.
+ *
+ *		For skb-type and xdp-type dynptrs:
+ *		    * The data slice is automatically invalidated anytime a
+ *		      helper call that changes the underlying packet buffer
+ *		      (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ())
  *		      is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds or in a paged buffer for skb-type dynptrs.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs or
+ *		across fragments for xdp-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5386,6 +5391,19 @@ union bpf_attr {
  *		*flags* is currently unused, it must be 0 for now.
  *	Return
  *		0 on success or -EINVAL if flags is not 0.
+ *
+ * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
+ *		context.
+ *
+ *		Calls that change the *xdp_md*'s underlying packet buffer
+ *		(eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but
+ *		they do invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success, -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5598,6 +5616,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
 	FN(dynptr_from_skb),		\
+	FN(dynptr_from_xdp),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 471a01a9b6ae..2b9dc4c6de04 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1541,6 +1541,8 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
 		return 0;
 	case BPF_DYNPTR_TYPE_SKB:
 		return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
+	case BPF_DYNPTR_TYPE_XDP:
+		return __bpf_xdp_load_bytes(src->data, src->offset + offset, dst, len);
 	default:
 		WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
 		return -EFAULT;
@@ -1583,6 +1585,10 @@ BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *,
 	case BPF_DYNPTR_TYPE_SKB:
 		return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
 					     flags);
+	case BPF_DYNPTR_TYPE_XDP:
+		if (flags)
+			return -EINVAL;
+		return __bpf_xdp_store_bytes(dst->data, dst->offset + offset, src, len);
 	default:
 		WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
 		return -EFAULT;
@@ -1616,7 +1622,7 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
 
 	type = bpf_dynptr_get_type(ptr);
 
-	/* Only skb dynptrs can get read-only data slices, because the
+	/* Only skb and xdp dynptrs can get read-only data slices, because the
 	 * verifier enforces PTR_TO_PACKET accesses
 	 */
 	is_rdonly = bpf_dynptr_is_rdonly(ptr);
@@ -1640,6 +1646,12 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
 		data = skb->data;
 		break;
 	}
+	case BPF_DYNPTR_TYPE_XDP:
+		/* if the requested data in across fragments, then it cannot
+		 * be accessed directly - bpf_xdp_pointer will return NULL
+		 */
+		return (unsigned long)bpf_xdp_pointer(ptr->data,
+						      ptr->offset + offset, len);
 	default:
 		WARN(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
 		return 0;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1ea295f47525..d33648eee188 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -686,6 +686,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_RINGBUF;
 	case DYNPTR_TYPE_SKB:
 		return BPF_DYNPTR_TYPE_SKB;
+	case DYNPTR_TYPE_XDP:
+		return BPF_DYNPTR_TYPE_XDP;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -6078,6 +6080,9 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			case DYNPTR_TYPE_SKB:
 				err_extra = "skb ";
 				break;
+			case DYNPTR_TYPE_XDP:
+				err_extra = "xdp ";
+				break;
 			}
 
 			verbose(env, "Expected an initialized %sdynptr as arg #%d\n",
@@ -7439,7 +7444,8 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		mark_reg_known_zero(env, regs, BPF_REG_0);
 
 		if (func_id == BPF_FUNC_dynptr_data &&
-		    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
+		    (dynptr_type == BPF_DYNPTR_TYPE_SKB ||
+		     dynptr_type == BPF_DYNPTR_TYPE_XDP)) {
 			regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
 			regs[BPF_REG_0].range = meta.mem_size;
 		} else {
diff --git a/net/core/filter.c b/net/core/filter.c
index 312f99deb759..3c8ba88eabb4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3825,7 +3825,29 @@ static const struct bpf_func_proto sk_skb_change_head_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_1(bpf_xdp_get_buff_len, struct  xdp_buff*, xdp)
+BPF_CALL_3(bpf_dynptr_from_xdp, struct xdp_buff*, xdp, u64, flags,
+	   struct bpf_dynptr_kern *, ptr)
+{
+	if (flags) {
+		bpf_dynptr_set_null(ptr);
+		return -EINVAL;
+	}
+
+	bpf_dynptr_init(ptr, xdp, BPF_DYNPTR_TYPE_XDP, 0, xdp_get_buff_len(xdp));
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_dynptr_from_xdp_proto = {
+	.func		= bpf_dynptr_from_xdp,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_XDP | MEM_UNINIT,
+};
+
+BPF_CALL_1(bpf_xdp_get_buff_len, struct xdp_buff*, xdp)
 {
 	return xdp_get_buff_len(xdp);
 }
@@ -3927,7 +3949,7 @@ static void bpf_xdp_copy_buf(struct xdp_buff *xdp, unsigned long off,
 	}
 }
 
-static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
+void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 	u32 size = xdp->data_end - xdp->data;
@@ -3958,8 +3980,7 @@ static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
 	return offset + len <= size ? addr + offset : NULL;
 }
 
-BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
-	   void *, buf, u32, len)
+int __bpf_xdp_load_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len)
 {
 	void *ptr;
 
@@ -3975,6 +3996,12 @@ BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
 	return 0;
 }
 
+BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
+	   void *, buf, u32, len)
+{
+	return __bpf_xdp_load_bytes(xdp, offset, buf, len);
+}
+
 static const struct bpf_func_proto bpf_xdp_load_bytes_proto = {
 	.func		= bpf_xdp_load_bytes,
 	.gpl_only	= false,
@@ -3985,8 +4012,7 @@ static const struct bpf_func_proto bpf_xdp_load_bytes_proto = {
 	.arg4_type	= ARG_CONST_SIZE,
 };
 
-BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
-	   void *, buf, u32, len)
+int __bpf_xdp_store_bytes(struct xdp_buff *xdp, u32 offset, void *buf, u32 len)
 {
 	void *ptr;
 
@@ -4002,6 +4028,12 @@ BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
 	return 0;
 }
 
+BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
+	   void *, buf, u32, len)
+{
+	return __bpf_xdp_store_bytes(xdp, offset, buf, len);
+}
+
 static const struct bpf_func_proto bpf_xdp_store_bytes_proto = {
 	.func		= bpf_xdp_store_bytes,
 	.gpl_only	= false,
@@ -8091,6 +8123,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_raw_check_syncookie_ipv6_proto;
 #endif
 #endif
+	case BPF_FUNC_dynptr_from_xdp:
+		return &bpf_dynptr_from_xdp_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a66b2b4df4fd..017a0cc7ebd1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5281,13 +5281,18 @@ union bpf_attr {
  *		      and try again.
  *
  *		    * The data slice is automatically invalidated anytime
- *		      **bpf_dynptr_write**\ () or a helper call that changes
- *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      **bpf_dynptr_write**\ () is called.
+ *
+ *		For skb-type and xdp-type dynptrs:
+ *		    * The data slice is automatically invalidated anytime a
+ *		      helper call that changes the underlying packet buffer
+ *		      (eg **bpf_skb_pull_data**\ (), **bpf_xdp_adjust_head**\ ())
  *		      is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds or in a paged buffer for skb-type dynptrs.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs or
+ *		across fragments for xdp-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5386,6 +5391,19 @@ union bpf_attr {
  *		*flags* is currently unused, it must be 0 for now.
  *	Return
  *		0 on success or -EINVAL if flags is not 0.
+ *
+ * long bpf_dynptr_from_xdp(struct xdp_buff *xdp_md, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *xdp_md*. *xdp_md* must be the BPF program
+ *		context.
+ *
+ *		Calls that change the *xdp_md*'s underlying packet buffer
+ *		(eg **bpf_xdp_adjust_head**\ ()) do not invalidate the dynptr, but
+ *		they do invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success, -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5598,6 +5616,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
 	FN(dynptr_from_skb),		\
+	FN(dynptr_from_xdp),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related

* [PATCH bpf-next v2 1/3] bpf: Add skb dynptrs
From: Joanne Koong @ 2022-08-11 23:04 UTC (permalink / raw)
  To: bpf; +Cc: kafai, kuba, andrii, daniel, ast, netdev, kernel-team,
	Joanne Koong
In-Reply-To: <20220811230501.2632393-1-joannelkoong@gmail.com>

Add skb dynptrs, which are dynptrs whose underlying pointer points
to a skb. The dynptr acts on skb data. skb dynptrs have two main
benefits. One is that they allow operations on sizes that are not
statically known at compile-time (eg variable-sized accesses).
Another is that parsing the packet data through dynptrs (instead of
through direct access of skb->data and skb->data_end) can be more
ergonomic and less brittle (eg does not need manual if checking for
being within bounds of data_end).

For bpf prog types that don't support writes on skb data, the dynptr is
read-only. For reads and writes through the bpf_dynptr_read() and
bpf_dynptr_write() interfaces, this supports reading and writing into
data in the non-linear paged buffers. For data slices (through the
bpf_dynptr_data() interface), if the data is in a paged buffer, the user
must first call bpf_skb_pull_data() to pull the data into the linear
portion. The returned data slice from a call to bpf_dynptr_data() is of
reg type PTR_TO_PACKET | PTR_MAYBE_NULL.

Any bpf_dynptr_write() automatically invalidates any prior data slices
to the skb dynptr. This is because a bpf_dynptr_write() may be writing
to data in a paged buffer, so it will need to pull the buffer first into
the head. The reason it needs to be pulled instead of writing directly to
the paged buffers is because they may be cloned (only the head of the skb
is by default uncloned). As such, any bpf_dynptr_write() will
automatically have its prior data slices invalidated, even if the write
is to data in the skb head (the verifier has no way of differentiating
whether the write is to the head or paged buffers during program load
time). Please note as well that any other helper calls that change the
underlying packet buffer (eg bpf_skb_pull_data()) invalidates any data
slices of the skb dynptr as well. Whenever such a helper call is made,
the verifier marks any PTR_TO_PACKET reg type (which includes skb dynptr
slices since they are PTR_TO_PACKETs) as unknown. The stack trace for
this is check_helper_call() -> clear_all_pkt_pointers() ->
__clear_all_pkt_pointers() -> mark_reg_unknown()

For examples of how skb dynptrs can be used, please see the attached
selftests.

Signed-off-by: Joanne Koong <joannelkoong@gmail.com>
---
 include/linux/bpf.h            |  8 ++-
 include/linux/filter.h         |  4 ++
 include/uapi/linux/bpf.h       | 40 ++++++++++++--
 kernel/bpf/helpers.c           | 81 +++++++++++++++++++++++++---
 kernel/bpf/verifier.c          | 99 ++++++++++++++++++++++++++++------
 net/core/filter.c              | 53 ++++++++++++++++--
 tools/include/uapi/linux/bpf.h | 40 ++++++++++++--
 7 files changed, 289 insertions(+), 36 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a627a02cf8ab..be9fb563c318 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -407,11 +407,14 @@ enum bpf_type_flag {
 	/* Size is known at compile time. */
 	MEM_FIXED_SIZE		= BIT(10 + BPF_BASE_TYPE_BITS),
 
+	/* DYNPTR points to sk_buff */
+	DYNPTR_TYPE_SKB		= BIT(11 + BPF_BASE_TYPE_BITS),
+
 	__BPF_TYPE_FLAG_MAX,
 	__BPF_TYPE_LAST_FLAG	= __BPF_TYPE_FLAG_MAX - 1,
 };
 
-#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF)
+#define DYNPTR_TYPE_FLAG_MASK	(DYNPTR_TYPE_LOCAL | DYNPTR_TYPE_RINGBUF | DYNPTR_TYPE_SKB)
 
 /* Max number of base types. */
 #define BPF_BASE_TYPE_LIMIT	(1UL << BPF_BASE_TYPE_BITS)
@@ -2557,12 +2560,15 @@ enum bpf_dynptr_type {
 	BPF_DYNPTR_TYPE_LOCAL,
 	/* Underlying data is a ringbuf record */
 	BPF_DYNPTR_TYPE_RINGBUF,
+	/* Underlying data is a sk_buff */
+	BPF_DYNPTR_TYPE_SKB,
 };
 
 void bpf_dynptr_init(struct bpf_dynptr_kern *ptr, void *data,
 		     enum bpf_dynptr_type type, u32 offset, u32 size);
 void bpf_dynptr_set_null(struct bpf_dynptr_kern *ptr);
 int bpf_dynptr_check_size(u32 size);
+void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr);
 
 #ifdef CONFIG_BPF_LSM
 void bpf_cgroup_atype_get(u32 attach_btf_id, int cgroup_atype);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a5f21dc3c432..649063d9cbfd 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1532,4 +1532,8 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 	return XDP_REDIRECT;
 }
 
+int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len);
+int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
+			  u32 len, u64 flags);
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7d1e2794d83e..a6bca9619d41 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5251,11 +5251,22 @@ union bpf_attr {
  *	Description
  *		Write *len* bytes from *src* into *dst*, starting from *offset*
  *		into *dst*.
- *		*flags* is currently unused.
+ *
+ *		*flags* must be 0 except for skb-type dynptrs.
+ *
+ *		For skb-type dynptrs:
+ *		    *  All data slices of the dynptr are automatically
+ *		       invalidated after **bpf_dynptr_write**\ (). If you wish to
+ *		       avoid this, please perform the write using direct data slices
+ *		       instead.
+ *
+ *		    *  For *flags*, please see the flags accepted by
+ *		       **bpf_skb_store_bytes**\ ().
  *	Return
  *		0 on success, -E2BIG if *offset* + *len* exceeds the length
  *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
- *		is a read-only dynptr or if *flags* is not 0.
+ *		is a read-only dynptr or if *flags* is not correct. For skb-type dynptrs,
+ *		other errors correspond to errors returned by **bpf_skb_store_bytes**\ ().
  *
  * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
  *	Description
@@ -5263,10 +5274,20 @@ union bpf_attr {
  *
  *		*len* must be a statically known value. The returned data slice
  *		is invalidated whenever the dynptr is invalidated.
+ *
+ *		For skb-type dynptrs:
+ *		    * If *offset* + *len* extends into the skb's paged buffers,
+ *		      the user should manually pull the skb with **bpf_skb_pull_data**\ ()
+ *		      and try again.
+ *
+ *		    * The data slice is automatically invalidated anytime
+ *		      **bpf_dynptr_write**\ () or a helper call that changes
+ *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5353,6 +5374,18 @@ union bpf_attr {
  *	Return
  *		Current *ktime*.
  *
+ * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *skb*. *skb* must be the BPF program
+ *		context. Depending on program type, the dynptr may be read-only.
+ *
+ *		Calls that change the *skb*'s underlying packet buffer
+ *		(eg **bpf_skb_pull_data**\ ()) do not invalidate the dynptr, but
+ *		they do invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success or -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5564,6 +5597,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
+	FN(dynptr_from_skb),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c
index 3c1b9bbcf971..471a01a9b6ae 100644
--- a/kernel/bpf/helpers.c
+++ b/kernel/bpf/helpers.c
@@ -1437,11 +1437,21 @@ static bool bpf_dynptr_is_rdonly(struct bpf_dynptr_kern *ptr)
 	return ptr->size & DYNPTR_RDONLY_BIT;
 }
 
+void bpf_dynptr_set_rdonly(struct bpf_dynptr_kern *ptr)
+{
+	ptr->size |= DYNPTR_RDONLY_BIT;
+}
+
 static void bpf_dynptr_set_type(struct bpf_dynptr_kern *ptr, enum bpf_dynptr_type type)
 {
 	ptr->size |= type << DYNPTR_TYPE_SHIFT;
 }
 
+static enum bpf_dynptr_type bpf_dynptr_get_type(const struct bpf_dynptr_kern *ptr)
+{
+	return (ptr->size & ~(DYNPTR_RDONLY_BIT)) >> DYNPTR_TYPE_SHIFT;
+}
+
 static u32 bpf_dynptr_get_size(struct bpf_dynptr_kern *ptr)
 {
 	return ptr->size & DYNPTR_SIZE_MASK;
@@ -1512,6 +1522,7 @@ static const struct bpf_func_proto bpf_dynptr_from_mem_proto = {
 BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src,
 	   u32, offset, u64, flags)
 {
+	enum bpf_dynptr_type type;
 	int err;
 
 	if (!src->data || flags)
@@ -1521,9 +1532,19 @@ BPF_CALL_5(bpf_dynptr_read, void *, dst, u32, len, struct bpf_dynptr_kern *, src
 	if (err)
 		return err;
 
-	memcpy(dst, src->data + src->offset + offset, len);
+	type = bpf_dynptr_get_type(src);
 
-	return 0;
+	switch (type) {
+	case BPF_DYNPTR_TYPE_LOCAL:
+	case BPF_DYNPTR_TYPE_RINGBUF:
+		memcpy(dst, src->data + src->offset + offset, len);
+		return 0;
+	case BPF_DYNPTR_TYPE_SKB:
+		return __bpf_skb_load_bytes(src->data, src->offset + offset, dst, len);
+	default:
+		WARN(true, "bpf_dynptr_read: unknown dynptr type %d\n", type);
+		return -EFAULT;
+	}
 }
 
 static const struct bpf_func_proto bpf_dynptr_read_proto = {
@@ -1540,18 +1561,32 @@ static const struct bpf_func_proto bpf_dynptr_read_proto = {
 BPF_CALL_5(bpf_dynptr_write, struct bpf_dynptr_kern *, dst, u32, offset, void *, src,
 	   u32, len, u64, flags)
 {
+	enum bpf_dynptr_type type;
 	int err;
 
-	if (!dst->data || flags || bpf_dynptr_is_rdonly(dst))
+	if (!dst->data || bpf_dynptr_is_rdonly(dst))
 		return -EINVAL;
 
 	err = bpf_dynptr_check_off_len(dst, offset, len);
 	if (err)
 		return err;
 
-	memcpy(dst->data + dst->offset + offset, src, len);
+	type = bpf_dynptr_get_type(dst);
 
-	return 0;
+	switch (type) {
+	case BPF_DYNPTR_TYPE_LOCAL:
+	case BPF_DYNPTR_TYPE_RINGBUF:
+		if (flags)
+			return -EINVAL;
+		memcpy(dst->data + dst->offset + offset, src, len);
+		return 0;
+	case BPF_DYNPTR_TYPE_SKB:
+		return __bpf_skb_store_bytes(dst->data, dst->offset + offset, src, len,
+					     flags);
+	default:
+		WARN(true, "bpf_dynptr_write: unknown dynptr type %d\n", type);
+		return -EFAULT;
+	}
 }
 
 static const struct bpf_func_proto bpf_dynptr_write_proto = {
@@ -1567,6 +1602,9 @@ static const struct bpf_func_proto bpf_dynptr_write_proto = {
 
 BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len)
 {
+	enum bpf_dynptr_type type;
+	bool is_rdonly;
+	void *data;
 	int err;
 
 	if (!ptr->data)
@@ -1576,10 +1614,37 @@ BPF_CALL_3(bpf_dynptr_data, struct bpf_dynptr_kern *, ptr, u32, offset, u32, len
 	if (err)
 		return 0;
 
-	if (bpf_dynptr_is_rdonly(ptr))
-		return 0;
+	type = bpf_dynptr_get_type(ptr);
+
+	/* Only skb dynptrs can get read-only data slices, because the
+	 * verifier enforces PTR_TO_PACKET accesses
+	 */
+	is_rdonly = bpf_dynptr_is_rdonly(ptr);
+
+	switch (type) {
+	case BPF_DYNPTR_TYPE_LOCAL:
+	case BPF_DYNPTR_TYPE_RINGBUF:
+		if (is_rdonly)
+			return 0;
+
+		data = ptr->data;
+		break;
+	case BPF_DYNPTR_TYPE_SKB:
+	{
+		struct sk_buff *skb = ptr->data;
 
-	return (unsigned long)(ptr->data + ptr->offset + offset);
+		/* if the data is paged, the caller needs to pull it first */
+		if (ptr->offset + offset + len > skb->len - skb->data_len)
+			return 0;
+
+		data = skb->data;
+		break;
+	}
+	default:
+		WARN(true, "bpf_dynptr_data: unknown dynptr type %d\n", type);
+		return 0;
+	}
+	return (unsigned long)(data + ptr->offset + offset);
 }
 
 static const struct bpf_func_proto bpf_dynptr_data_proto = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 2c1f8069f7b7..1ea295f47525 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -684,6 +684,8 @@ static enum bpf_dynptr_type arg_to_dynptr_type(enum bpf_arg_type arg_type)
 		return BPF_DYNPTR_TYPE_LOCAL;
 	case DYNPTR_TYPE_RINGBUF:
 		return BPF_DYNPTR_TYPE_RINGBUF;
+	case DYNPTR_TYPE_SKB:
+		return BPF_DYNPTR_TYPE_SKB;
 	default:
 		return BPF_DYNPTR_TYPE_INVALID;
 	}
@@ -5826,12 +5828,29 @@ int check_func_arg_reg_off(struct bpf_verifier_env *env,
 	return __check_ptr_off_reg(env, reg, regno, fixed_off_ok);
 }
 
-static u32 stack_slot_get_id(struct bpf_verifier_env *env, struct bpf_reg_state *reg)
+static struct bpf_reg_state *get_dynptr_arg_reg(const struct bpf_func_proto *fn,
+						struct bpf_reg_state *regs)
+{
+	int i;
+
+	for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++)
+		if (arg_type_is_dynptr(fn->arg_type[i]))
+			return &regs[BPF_REG_1 + i];
+
+	return NULL;
+}
+
+static enum bpf_dynptr_type stack_slot_get_dynptr_info(struct bpf_verifier_env *env,
+						       struct bpf_reg_state *reg,
+						       int *ref_obj_id)
 {
 	struct bpf_func_state *state = func(env, reg);
 	int spi = get_spi(reg->off);
 
-	return state->stack[spi].spilled_ptr.id;
+	if (ref_obj_id)
+		*ref_obj_id = state->stack[spi].spilled_ptr.id;
+
+	return state->stack[spi].spilled_ptr.dynptr.type;
 }
 
 static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
@@ -6056,7 +6075,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg,
 			case DYNPTR_TYPE_RINGBUF:
 				err_extra = "ringbuf ";
 				break;
-			default:
+			case DYNPTR_TYPE_SKB:
+				err_extra = "skb ";
 				break;
 			}
 
@@ -7149,6 +7169,7 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 {
 	enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
 	const struct bpf_func_proto *fn = NULL;
+	enum bpf_dynptr_type dynptr_type;
 	enum bpf_return_type ret_type;
 	enum bpf_type_flag ret_flag;
 	struct bpf_reg_state *regs;
@@ -7320,24 +7341,43 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 			}
 		}
 		break;
-	case BPF_FUNC_dynptr_data:
-		for (i = 0; i < MAX_BPF_FUNC_REG_ARGS; i++) {
-			if (arg_type_is_dynptr(fn->arg_type[i])) {
-				if (meta.ref_obj_id) {
-					verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
-					return -EFAULT;
-				}
-				/* Find the id of the dynptr we're tracking the reference of */
-				meta.ref_obj_id = stack_slot_get_id(env, &regs[BPF_REG_1 + i]);
-				break;
-			}
+	case BPF_FUNC_dynptr_write:
+	{
+		struct bpf_reg_state *reg;
+
+		reg = get_dynptr_arg_reg(fn, regs);
+		if (!reg) {
+			verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
+			return -EFAULT;
 		}
-		if (i == MAX_BPF_FUNC_REG_ARGS) {
+
+		/* bpf_dynptr_write() for skb-type dynptrs may pull the skb, so we must
+		 * invalidate all data slices associated with it
+		 */
+		if (stack_slot_get_dynptr_info(env, reg, NULL) == BPF_DYNPTR_TYPE_SKB)
+			changes_data = true;
+
+		break;
+	}
+	case BPF_FUNC_dynptr_data:
+	{
+		struct bpf_reg_state *reg;
+
+		reg = get_dynptr_arg_reg(fn, regs);
+		if (!reg) {
 			verbose(env, "verifier internal error: no dynptr in bpf_dynptr_data()\n");
 			return -EFAULT;
 		}
+
+		if (meta.ref_obj_id) {
+			verbose(env, "verifier internal error: meta.ref_obj_id already set\n");
+			return -EFAULT;
+		}
+
+		dynptr_type = stack_slot_get_dynptr_info(env, reg, &meta.ref_obj_id);
 		break;
 	}
+	}
 
 	if (err)
 		return err;
@@ -7397,8 +7437,15 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		break;
 	case RET_PTR_TO_ALLOC_MEM:
 		mark_reg_known_zero(env, regs, BPF_REG_0);
-		regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
-		regs[BPF_REG_0].mem_size = meta.mem_size;
+
+		if (func_id == BPF_FUNC_dynptr_data &&
+		    dynptr_type == BPF_DYNPTR_TYPE_SKB) {
+			regs[BPF_REG_0].type = PTR_TO_PACKET | ret_flag;
+			regs[BPF_REG_0].range = meta.mem_size;
+		} else {
+			regs[BPF_REG_0].type = PTR_TO_MEM | ret_flag;
+			regs[BPF_REG_0].mem_size = meta.mem_size;
+		}
 		break;
 	case RET_PTR_TO_MEM_OR_BTF_ID:
 	{
@@ -14141,6 +14188,24 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto patch_call_imm;
 		}
 
+		if (insn->imm == BPF_FUNC_dynptr_from_skb) {
+			bool is_rdonly = !may_access_direct_pkt_data(env, NULL, BPF_WRITE);
+
+			insn_buf[0] = BPF_MOV32_IMM(BPF_REG_4, is_rdonly);
+			insn_buf[1] = *insn;
+			cnt = 2;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta += cnt - 1;
+			env->prog = new_prog;
+			prog = new_prog;
+			insn = new_prog->insnsi + i + delta;
+			goto patch_call_imm;
+		}
+
 		/* BPF_EMIT_CALL() assumptions in some of the map_gen_lookup
 		 * and other inlining handlers are currently limited to 64 bit
 		 * only.
diff --git a/net/core/filter.c b/net/core/filter.c
index 5669248aff25..312f99deb759 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1681,8 +1681,8 @@ static inline void bpf_pull_mac_rcsum(struct sk_buff *skb)
 		skb_postpull_rcsum(skb, skb_mac_header(skb), skb->mac_len);
 }
 
-BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
-	   const void *, from, u32, len, u64, flags)
+int __bpf_skb_store_bytes(struct sk_buff *skb, u32 offset, const void *from,
+			  u32 len, u64 flags)
 {
 	void *ptr;
 
@@ -1707,6 +1707,12 @@ BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
 	return 0;
 }
 
+BPF_CALL_5(bpf_skb_store_bytes, struct sk_buff *, skb, u32, offset,
+	   const void *, from, u32, len, u64, flags)
+{
+	return __bpf_skb_store_bytes(skb, offset, from, len, flags);
+}
+
 static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
 	.func		= bpf_skb_store_bytes,
 	.gpl_only	= false,
@@ -1718,8 +1724,7 @@ static const struct bpf_func_proto bpf_skb_store_bytes_proto = {
 	.arg5_type	= ARG_ANYTHING,
 };
 
-BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
-	   void *, to, u32, len)
+int __bpf_skb_load_bytes(const struct sk_buff *skb, u32 offset, void *to, u32 len)
 {
 	void *ptr;
 
@@ -1738,6 +1743,12 @@ BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
 	return -EFAULT;
 }
 
+BPF_CALL_4(bpf_skb_load_bytes, const struct sk_buff *, skb, u32, offset,
+	   void *, to, u32, len)
+{
+	return __bpf_skb_load_bytes(skb, offset, to, len);
+}
+
 static const struct bpf_func_proto bpf_skb_load_bytes_proto = {
 	.func		= bpf_skb_load_bytes,
 	.gpl_only	= false,
@@ -1849,6 +1860,32 @@ static const struct bpf_func_proto bpf_skb_pull_data_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+/* is_rdonly is set by the verifier */
+BPF_CALL_4(bpf_dynptr_from_skb, struct sk_buff *, skb, u64, flags,
+	   struct bpf_dynptr_kern *, ptr, u32, is_rdonly)
+{
+	if (flags) {
+		bpf_dynptr_set_null(ptr);
+		return -EINVAL;
+	}
+
+	bpf_dynptr_init(ptr, skb, BPF_DYNPTR_TYPE_SKB, 0, skb->len);
+
+	if (is_rdonly)
+		bpf_dynptr_set_rdonly(ptr);
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_dynptr_from_skb_proto = {
+	.func		= bpf_dynptr_from_skb,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_PTR_TO_DYNPTR | DYNPTR_TYPE_SKB | MEM_UNINIT,
+};
+
 BPF_CALL_1(bpf_sk_fullsock, struct sock *, sk)
 {
 	return sk_fullsock(sk) ? (unsigned long)sk : (unsigned long)NULL;
@@ -7808,6 +7845,8 @@ sk_filter_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_socket_uid_proto;
 	case BPF_FUNC_perf_event_output:
 		return &bpf_skb_event_output_proto;
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
@@ -7991,6 +8030,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_tcp_raw_check_syncookie_ipv6_proto;
 #endif
 #endif
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
@@ -8186,6 +8227,8 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skc_lookup_tcp:
 		return &bpf_skc_lookup_tcp_proto;
 #endif
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
@@ -8224,6 +8267,8 @@ lwt_out_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_get_smp_processor_id_proto;
 	case BPF_FUNC_skb_under_cgroup:
 		return &bpf_skb_under_cgroup_proto;
+	case BPF_FUNC_dynptr_from_skb:
+		return &bpf_dynptr_from_skb_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e174ad28aeb7..a66b2b4df4fd 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5251,11 +5251,22 @@ union bpf_attr {
  *	Description
  *		Write *len* bytes from *src* into *dst*, starting from *offset*
  *		into *dst*.
- *		*flags* is currently unused.
+ *
+ *		*flags* must be 0 except for skb-type dynptrs.
+ *
+ *		For skb-type dynptrs:
+ *		    *  All data slices of the dynptr are automatically
+ *		       invalidated after **bpf_dynptr_write**\ (). If you wish to
+ *		       avoid this, please perform the write using direct data slices
+ *		       instead.
+ *
+ *		    *  For *flags*, please see the flags accepted by
+ *		       **bpf_skb_store_bytes**\ ().
  *	Return
  *		0 on success, -E2BIG if *offset* + *len* exceeds the length
  *		of *dst*'s data, -EINVAL if *dst* is an invalid dynptr or if *dst*
- *		is a read-only dynptr or if *flags* is not 0.
+ *		is a read-only dynptr or if *flags* is not correct. For skb-type dynptrs,
+ *		other errors correspond to errors returned by **bpf_skb_store_bytes**\ ().
  *
  * void *bpf_dynptr_data(struct bpf_dynptr *ptr, u32 offset, u32 len)
  *	Description
@@ -5263,10 +5274,20 @@ union bpf_attr {
  *
  *		*len* must be a statically known value. The returned data slice
  *		is invalidated whenever the dynptr is invalidated.
+ *
+ *		For skb-type dynptrs:
+ *		    * If *offset* + *len* extends into the skb's paged buffers,
+ *		      the user should manually pull the skb with **bpf_skb_pull_data**\ ()
+ *		      and try again.
+ *
+ *		    * The data slice is automatically invalidated anytime
+ *		      **bpf_dynptr_write**\ () or a helper call that changes
+ *		      the underlying packet buffer (eg **bpf_skb_pull_data**\ ())
+ *		      is called.
  *	Return
  *		Pointer to the underlying dynptr data, NULL if the dynptr is
  *		read-only, if the dynptr is invalid, or if the offset and length
- *		is out of bounds.
+ *		is out of bounds or in a paged buffer for skb-type dynptrs.
  *
  * s64 bpf_tcp_raw_gen_syncookie_ipv4(struct iphdr *iph, struct tcphdr *th, u32 th_len)
  *	Description
@@ -5353,6 +5374,18 @@ union bpf_attr {
  *	Return
  *		Current *ktime*.
  *
+ * long bpf_dynptr_from_skb(struct sk_buff *skb, u64 flags, struct bpf_dynptr *ptr)
+ *	Description
+ *		Get a dynptr to the data in *skb*. *skb* must be the BPF program
+ *		context. Depending on program type, the dynptr may be read-only.
+ *
+ *		Calls that change the *skb*'s underlying packet buffer
+ *		(eg **bpf_skb_pull_data**\ ()) do not invalidate the dynptr, but
+ *		they do invalidate any data slices associated with the dynptr.
+ *
+ *		*flags* is currently unused, it must be 0 for now.
+ *	Return
+ *		0 on success or -EINVAL if flags is not 0.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5564,6 +5597,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(ktime_get_tai_ns),		\
+	FN(dynptr_from_skb),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.30.2


^ permalink raw reply related

* [PATCH bpf-next v2 0/3] Add skb + xdp dynptrs
From: Joanne Koong @ 2022-08-11 23:04 UTC (permalink / raw)
  To: bpf; +Cc: kafai, kuba, andrii, daniel, ast, netdev, kernel-team,
	Joanne Koong

This patchset is the 2nd in the dynptr series. The 1st can be found here [0].

This patchset adds skb and xdp type dynptrs, which have two main benefits for
packet parsing:
    * allowing operations on sizes that are not statically known at
      compile-time (eg variable-sized accesses).
    * more ergonomic and less brittle iteration through data (eg does not need
      manual if checking for being within bounds of data_end)

When comparing the differences in runtime for packet parsing without dynptrs
vs. with dynptrs for the more simple cases, there is no noticeable difference.
For the more complex cases where lengths are non-statically known at compile
time, there can be a significant speed-up when using dynptrs (eg a 2x speed up
for cls redirection). Patch 3 contains more details as well as examples of how
to use skb and xdp dynptrs.

[0] https://lore.kernel.org/bpf/20220523210712.3641569-1-joannelkoong@gmail.com/

--
Changelog:

v1 = https://lore.kernel.org/bpf/20220726184706.954822-1-joannelkoong@gmail.com/
v1 -> v2
  * Return data slices to rd-only skb dynptrs (Martin)
  * bpf_dynptr_write allows writes to frags for skb dynptrs, but always
    invalidates associated data slices (Martin)
  * Use switch casing instead of ifs (Andrii)
  * Use 0xFD for experimental kind number in the selftest (Zvi)
  * Put selftest conversions w/ dynptrs into new files (Alexei)
  * Add new selftest "test_cls_redirect_dynptr.c" 


Joanne Koong (3):
  bpf: Add skb dynptrs
  bpf: Add xdp dynptrs
  selftests/bpf: tests for using dynptrs to parse skb and xdp buffers

 include/linux/bpf.h                           |  14 +-
 include/linux/filter.h                        |   7 +
 include/uapi/linux/bpf.h                      |  59 +-
 kernel/bpf/helpers.c                          |  93 +-
 kernel/bpf/verifier.c                         | 105 +-
 net/core/filter.c                             |  99 +-
 tools/include/uapi/linux/bpf.h                |  59 +-
 .../selftests/bpf/prog_tests/cls_redirect.c   |  25 +
 .../testing/selftests/bpf/prog_tests/dynptr.c | 132 ++-
 .../selftests/bpf/prog_tests/l4lb_all.c       |   2 +
 .../bpf/prog_tests/parse_tcp_hdr_opt.c        |  85 ++
 .../selftests/bpf/prog_tests/xdp_attach.c     |   9 +-
 .../testing/selftests/bpf/progs/dynptr_fail.c | 111 ++
 .../selftests/bpf/progs/dynptr_success.c      |  29 +
 .../bpf/progs/test_cls_redirect_dynptr.c      | 968 ++++++++++++++++++
 .../bpf/progs/test_l4lb_noinline_dynptr.c     | 468 +++++++++
 .../bpf/progs/test_parse_tcp_hdr_opt.c        | 119 +++
 .../bpf/progs/test_parse_tcp_hdr_opt_dynptr.c | 110 ++
 .../selftests/bpf/progs/test_xdp_dynptr.c     | 240 +++++
 .../selftests/bpf/test_tcp_hdr_options.h      |   1 +
 20 files changed, 2649 insertions(+), 86 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/parse_tcp_hdr_opt.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_cls_redirect_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_l4lb_noinline_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_parse_tcp_hdr_opt_dynptr.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_dynptr.c

-- 
2.30.2


^ permalink raw reply

* Re: [RFC net-next 4/4] ynl: add a sample user for ethtool
From: Stanislav Fomichev @ 2022-08-11 22:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, davem, edumazet, pabeni, jacob.e.keller, vadfed, johannes,
	jiri, dsahern, stephen, fw, linux-doc
In-Reply-To: <20220811123515.4ef1a715@kernel.org>

On Thu, Aug 11, 2022 at 12:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 11 Aug 2022 09:18:03 -0700 sdf@google.com wrote:
> > > +attr-cnt-suffix: CNT
> >
> > Is it a hack to make the generated header fit into existing
> > implementation?
>
> Yup.
>
> > Should we #define ETHTOOL_XXX_CNT ETHTOOL_XXX in
> > the implementation instead? (or s/ETHTOOL_XXX_CNT/ETHTOOL_XXX/ the
> > source itself?)
>
> We could, I guess. To be clear this controls the count, IOW:
>
> enum {
>         PREFIX_A_BLA_ATTR = 1,
>         PREFIX_A_ANOTHER_ATTR,
>         PREFIX_A_AND_ONEMORE,
>         __PFREIX_A_CNT, // <--- This thing
> };
> #define PREFIX_A_MAX (__PFREIX_A_CNT - 1)
>
> It's not used in the generated code, only if we codegen the uAPI,
> AFAIR. So we'd need a way to tell the generator of the uAPI about
> the situation, anyway. I could be misremembering.

My worry is that we'll have more hacks like these and it's hard, as a
spec reader/writer, to figure out that they exist..
So I was wondering if it's "easier" (from the spec reader/writer pov)
to have some c-header-fixup: section where we can have plain
c-preprocessor hacks like these (where we need to redefine something
to match the old behavior).

> > > +attribute-spaces:
> >
> > Are you open to bike shedding? :-)
>
> I can't make promises that I'll change things but I'm curious
> to hear it!
>
> > I like how ethtool_netlink.h calls these 'message types'.
>
> It calls operation types message types, not attr spaces.
> I used ops because that's what genetlink calls things.

Coming from stubby/grpc, I was expecting to see words like
message/field/struct. The question is what's more confusing: sticking
with netlink naming or trying to map grpc/thrift concepts on top of
what we have. (I'm assuming more people know about grpc/thrift than
netlink)

messages: # or maybe 'attribute-sets' ?
  - name: channels
    ...

operations:
  - name: channel_get
    message: channels
    do:
      request:
        fields:
        - header
        - rx_max

Or maybe all we really need is a section in the doc called 'Netlink
for gRPC/Thrift users' where we map these concepts:
- attribute-spaces (attribute-sets?) -> messages
- attributes -> fields

> > > +  -
> > > +    name: header
> > > +    name-prefix: ETHTOOL_A_HEADER_
> >
> > Any issue with name-prefix+name-suffix being non-greppable? Have you tried
> > something like this instead:
> >
> > - name: ETHTOOL_A_HEADER # this is fake, for ynl reference only
> >    attributes:
> >      - name: ETHTOOL_A_HEADER_DEV_INDEX
> >        val:
> >        type:
> >      - name ETHTOOL_A_HEADER_DEV_NAME
> >        ..
> >
> > It seems a bit easier to map the spec into what it's going to produce.
> > For example, it took me a while to translate 'channels_get' below into
> > ETHTOOL_MSG_CHANNELS_GET.
> >
> > Or is it too much ETHTOOL_A_HEADER_?
>
> Dunno, that'd mean that the Python method is called
> ETHTOOL_MSG_CHANNELS_GET rather than just channels_get.
> I don't want to force all languages to use the C naming.
> The C naming just leads to silly copy'n'paste issues like
> f329a0ebeab.

Can we have 'name:' and 'long-name:' or 'c-name:' or 'full-name' ?

- name: header
   attributes:
    - name: dev_index
      full-name: ETHTOOL_A_HEADER_DEV_INDEX
      val:
      type:

Suppose I'm rewriting my c application from uapi to some generated (in
the future) python-like channels_get() method. If I can grep for
ETHTOOL_MSG_CHANNELS_GET, that would save me a bunch of time figuring
out what the new canonical wrapper is.

Also, maybe, at some point we'll have:
- name: dev_index
  c-name: ETHTOOL_A_HEADER_DEV_INDEX
  java-name: headerDevIndex

:-D

> > > +        len: ALTIFNAMSIZ - 1
> >
> > Not sure how strict you want to be here. ALTIFNAMSIZ is defined
> > somewhere else it seems? (IOW, do we want to have implicit dependencies
> > on external/uapi headers?)
>
> Good catch, I'm aware. I was planning to add a "header constants"
> section or some such. A section in "headers" which defines the
> constants which C code will get from the headers.

Define as in 're-define' or define as in 'you need to include some
other header for this to work'?

const:
  - name: ALTIFNAMSIZ
    val: 128

which then does

#ifndef
#define ALTIFNAMSIZ 128
#else
static_assert(ALTIFNAMSIZ == 128)
#endif

?

or:

external-const:
  - name: ALTIFNAMSIZ
    header: include/uapi/linux/if.h

which then might generate the following:

#include <include/uapi/linux/if.h>
#ifndef ALTIFNAMSIZ
#error "include/uapi/linux/if.h does not define ALTIFNAMSIZ"
#endif

?

> For Python it does not matter, as we don't have to size arrays.

Hm, I was expecting the situation to be the opposite :-) Because if
you really have to know this len in python, how do you resolve
ALTIFNAMSIZ?

The simplest thing to do might be to require these headers to be
hermetic (i.e., redefine all consts the spec cares about internally)?

> I was wondering if it will matter for other languages, like Rust?
>
> > > +            - header
> > > +            - rx_count
> > > +            - tx_count
> > > +            - other_count
> > > +            - combined_count
> >
> > My netlink is super rusty, might be worth mentioning in the spec: these
> > are possible attributes, but are all of them required?
>
> Right, will do, nothing is required, or rather requirements are kinda
> hard to express and checked by the code in the kernel.
>
> > You also mention the validation part in the cover letter, do you plan
> > add some of these policy properties to the spec or everything is
> > there already? (I'm assuming we care about the types which we have above and
> > optional/required attribute indication?)
>
> Yeah, my initial plan was to encode requirement in the policy but its
> not trivial. So I left it as future extension. Besides things which are
> required today may not be tomorrow, so its a bit of a strange thing.

That's the hardest part, but it should improve the observability, so
I'm looking forward :-)
As you say, it is probably hard to declaratively define these
requirements at this point for everything, but maybe some parts
(majority?) are doable.

> Regarding policy properties I'm intending to support all of the stuff
> that the kernel policies recognize... but somehow most families I
> converted don't have validation (only mask and length :S).
>
> Actually for DPLL I have a nice validation trick. You can define an
> enum:
>
> constants:
>   -
>     type: flags
>     name: genl_get_flags
>     value-prefix: DPLL_FLAG_
>     values: [ sources, outputs, status ]
>
> Then for an attribute you link it:
>
>       -
>         name: flags
>         type: u32
>         flags-mask: genl_get_flags
>
> And that will auto an enum:
>
> enum dpll_genl_get_flags {
>         DPLL_FLAG_SOURCES = 1,
>         DPLL_FLAG_OUTPUTS = 2,
>         DPLL_FLAG_STATUS = 4,
> };
>
> And a policy with a mask:
>
>         [DPLLA_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x7),

Yeah, that looks nice!

^ permalink raw reply

* Re: [PULL] Networking for 6.0-rc1
From: patchwork-bot+netdevbpf @ 2022-08-11 22:48 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: torvalds, davem, netdev, linux-kernel, pabeni
In-Reply-To: <20220811185102.3253045-1-kuba@kernel.org>

Hello:

This pull request was applied to netdev/net.git (master)
by Linus Torvalds <torvalds@linux-foundation.org>:

On Thu, 11 Aug 2022 11:51:02 -0700 you wrote:
> Hi Linus!
> 
> The following changes since commit f86d1fbbe7858884d6754534a0afbb74fc30bc26:
> 
>   Merge tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next (2022-08-03 16:29:08 -0700)
> 
> are available in the Git repository at:
> 
> [...]

Here is the summary with links:
  - [PULL] Networking for 6.0-rc1
    https://git.kernel.org/netdev/net/c/7ebfc85e2cd7

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCHv3 bpf-next] libbpf: Add names for auxiliary maps
From: patchwork-bot+netdevbpf @ 2022-08-11 22:20 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: netdev, quentin, andrii, ast, daniel, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, bpf
In-Reply-To: <20220811034020.529685-1-liuhangbin@gmail.com>

Hello:

This patch was applied to bpf/bpf-next.git (master)
by Andrii Nakryiko <andrii@kernel.org>:

On Thu, 11 Aug 2022 11:40:20 +0800 you wrote:
> The bpftool self-created maps can appear in final map show output due to
> deferred removal in kernel. These maps don't have a name, which would make
> users confused about where it comes from.
> 
> With a libbpf_ prefix name, users could know who created these maps.
> It also could make some tests (like test_offload.py, which skip base maps
> without names as a workaround) filter them out.
> 
> [...]

Here is the summary with links:
  - [PATCHv3,bpf-next] libbpf: Add names for auxiliary maps
    https://git.kernel.org/bpf/bpf-next/c/10b62d6a38f7

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCHv3 bpf-next] libbpf: Add names for auxiliary maps
From: Andrii Nakryiko @ 2022-08-11 22:15 UTC (permalink / raw)
  To: Hangbin Liu
  Cc: netdev, Quentin Monnet, Andrii Nakryiko, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf
In-Reply-To: <20220811034020.529685-1-liuhangbin@gmail.com>

On Wed, Aug 10, 2022 at 8:40 PM Hangbin Liu <liuhangbin@gmail.com> wrote:
>
> The bpftool self-created maps can appear in final map show output due to
> deferred removal in kernel. These maps don't have a name, which would make
> users confused about where it comes from.
>
> With a libbpf_ prefix name, users could know who created these maps.
> It also could make some tests (like test_offload.py, which skip base maps
> without names as a workaround) filter them out.
>
> Kernel adds bpf prog/map name support in the same merge
> commit fadad670a8ab ("Merge branch 'bpf-extend-info'"). So we can also use
> kernel_supports(NULL, FEAT_PROG_NAME) to check if kernel supports map name.
>
> As disscussed[1], Let's make bpf_map_create accept non-null
> name string, and silently ignore the name if kernel doesn't support.
>
> [1] https://lore.kernel.org/bpf/CAEf4BzYL1TQwo1231s83pjTdFPk9XWWhfZC5=KzkU-VO0k=0Ug@mail.gmail.com/
>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
> v3: let bpf_map_create ignore the name if kernel doesn't support
> v2: rename the wrapper with proper name
> ---

Applied to bpf-next. But let's also do the same for bpf_prog_load().
We can make probe_kern_prog_name() use sys_bpf_prog_load() directly
and avoid calling bpf_prog_load() and thus avoiding circular
dependency.

Will you be able to do this change?


>  tools/lib/bpf/bpf.c    | 2 +-
>  tools/lib/bpf/libbpf.c | 6 +++---
>  2 files changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> index efcc06dafbd9..6a96e665dc5d 100644
> --- a/tools/lib/bpf/bpf.c
> +++ b/tools/lib/bpf/bpf.c
> @@ -183,7 +183,7 @@ int bpf_map_create(enum bpf_map_type map_type,
>                 return libbpf_err(-EINVAL);
>
>         attr.map_type = map_type;
> -       if (map_name)
> +       if (map_name && kernel_supports(NULL, FEAT_PROG_NAME))
>                 libbpf_strlcpy(attr.map_name, map_name, sizeof(attr.map_name));
>         attr.key_size = key_size;
>         attr.value_size = value_size;
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index f7364ea82ac1..a075211b3730 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -4432,7 +4432,7 @@ static int probe_kern_global_data(void)
>         };
>         int ret, map, insn_cnt = ARRAY_SIZE(insns);
>
> -       map = bpf_map_create(BPF_MAP_TYPE_ARRAY, NULL, sizeof(int), 32, 1, NULL);
> +       map = bpf_map_create(BPF_MAP_TYPE_ARRAY, "libbpf_global", sizeof(int), 32, 1, NULL);
>         if (map < 0) {
>                 ret = -errno;
>                 cp = libbpf_strerror_r(ret, errmsg, sizeof(errmsg));
> @@ -4565,7 +4565,7 @@ static int probe_kern_array_mmap(void)
>         LIBBPF_OPTS(bpf_map_create_opts, opts, .map_flags = BPF_F_MMAPABLE);
>         int fd;
>
> -       fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, NULL, sizeof(int), sizeof(int), 1, &opts);
> +       fd = bpf_map_create(BPF_MAP_TYPE_ARRAY, "libbpf_mmap", sizeof(int), sizeof(int), 1, &opts);
>         return probe_fd(fd);
>  }
>
> @@ -4612,7 +4612,7 @@ static int probe_prog_bind_map(void)
>         };
>         int ret, map, prog, insn_cnt = ARRAY_SIZE(insns);
>
> -       map = bpf_map_create(BPF_MAP_TYPE_ARRAY, NULL, sizeof(int), 32, 1, NULL);
> +       map = bpf_map_create(BPF_MAP_TYPE_ARRAY, "libbpf_det_bind", sizeof(int), 32, 1, NULL);
>         if (map < 0) {
>                 ret = -errno;
>                 cp = libbpf_strerror_r(ret, errmsg, sizeof(errmsg));
> --
> 2.35.3
>

^ permalink raw reply

* Re: [PATCH 4/4] dt-bindings: net: dsa: mediatek,mt7530: update json-schema
From: Arınç ÜNAL @ 2022-08-11 22:09 UTC (permalink / raw)
  To: Krzysztof Kozlowski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Vladimir Oltean, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Rob Herring, Krzysztof Kozlowski,
	Matthias Brugger, Sean Wang, Landen Chao, DENG Qingfang,
	Frank Wunderlich, Luiz Angelo Daros de Luca, Sander Vanheule,
	René van Dorst, Daniel Golle, erkin.bozoglu,
	Sergio Paracuellos
  Cc: netdev, devicetree, linux-arm-kernel, linux-mediatek,
	linux-kernel
In-Reply-To: <e5cf8a19-637c-95cf-1527-11980c73f6c0@linaro.org>

On 2.08.2022 11:46, Krzysztof Kozlowski wrote:
> On 30/07/2022 16:26, Arınç ÜNAL wrote:
>> Update the json-schema for compatible devices.
>>
>> - Define acceptable phy-mode values for CPU port of each compatible device.
>> - Remove requiring the "reg" property since the referred dsa-port.yaml
>> already does that.
>> - Require mediatek,mcm for the described MT7621 SoCs as the compatible
>> string is only used for MT7530 which is a part of the multi-chip module.
> 
> 3 separate patches.

Roger.

> 
>>
>> Signed-off-by: Arınç ÜNAL <arinc.unal@arinc9.com>
>> ---
>>   .../bindings/net/dsa/mediatek,mt7530.yaml     | 220 +++++++++++++++---
>>   1 file changed, 191 insertions(+), 29 deletions(-)
>>
>> diff --git a/Documentation/devicetree/bindings/net/dsa/mediatek,mt7530.yaml b/Documentation/devicetree/bindings/net/dsa/mediatek,mt7530.yaml
>> index a88e650e910b..a37a14fba9f6 100644
>> --- a/Documentation/devicetree/bindings/net/dsa/mediatek,mt7530.yaml
>> +++ b/Documentation/devicetree/bindings/net/dsa/mediatek,mt7530.yaml
>> @@ -135,35 +135,6 @@ properties:
>>         the ethsys.
>>       maxItems: 1
>>   
>> -patternProperties:
>> -  "^(ethernet-)?ports$":
>> -    type: object
> 
> Actually four patches...
> 
> I don't find this change explained in commit msg. What is more, it looks
> incorrect. All properties and patternProperties should be explained in
> top-level part.
> 
> Defining such properties (with big piece of YAML) in each if:then: is no
> readable.

I can't figure out another way. I need to require certain properties for 
a compatible string AND certain enum/const for certain properties which 
are inside patternProperties for "^(ethernet-)?port@[0-9]+$" by reading 
the compatible string.

If I put allOf:if:then under patternProperties, I can't do the latter.

Other than readability to human eyes, binding check works as intended, 
in case there's no other way to do it.

Arınç

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox