Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 15/22] arch: vdso: consolidate gettime prototypes
From: Vincenzo Frascino @ 2023-11-10 10:16 UTC (permalink / raw)
  To: Arnd Bergmann, Andrew Morton, linux-kernel, Masahiro Yamada,
	linux-kbuild
  Cc: Arnd Bergmann, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Guo Ren, Peter Zijlstra, Ard Biesheuvel,
	Huacai Chen, Greg Ungerer, Michal Simek, Thomas Bogendoerfer,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Geoff Levand, Palmer Dabbelt, Heiko Carstens,
	John Paul Adrian Glaubitz, David S. Miller, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, x86, Helge Deller, Sudip Mukherjee,
	Greg Kroah-Hartman, Timur Tabi, Kent Overstreet, David Woodhouse,
	Naveen N. Rao, Anil S Keshavamurthy, Kees Cook, Juri Lelli,
	Vincent Guittot, Nathan Chancellor, Nick Desaulniers,
	Nicolas Schier, Al Viro, Uwe Kleine-König, linux-alpha,
	linux-snps-arc, linux-arm-kernel, linux-trace-kernel, linux-csky,
	loongarch, linux-m68k, linux-mips, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, netdev, linux-parisc, linux-usb,
	linux-fbdev, dri-devel, linux-bcachefs, linux-mtd
In-Reply-To: <20231108125843.3806765-16-arnd@kernel.org>

Hi Arnd,

On 11/8/23 12:58, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> The VDSO functions are defined as globals in the kernel sources but intended
> to be called from userspace, so there is no need to declare them in a kernel
> side header.
> 
> Without a prototype, this now causes warnings such as
> 
> arch/mips/vdso/vgettimeofday.c:14:5: error: no previous prototype for '__vdso_clock_gettime' [-Werror=missing-prototypes]
> arch/mips/vdso/vgettimeofday.c:28:5: error: no previous prototype for '__vdso_gettimeofday' [-Werror=missing-prototypes]
> arch/mips/vdso/vgettimeofday.c:36:5: error: no previous prototype for '__vdso_clock_getres' [-Werror=missing-prototypes]
> arch/mips/vdso/vgettimeofday.c:42:5: error: no previous prototype for '__vdso_clock_gettime64' [-Werror=missing-prototypes]
> arch/sparc/vdso/vclock_gettime.c:254:1: error: no previous prototype for '__vdso_clock_gettime' [-Werror=missing-prototypes]
> arch/sparc/vdso/vclock_gettime.c:282:1: error: no previous prototype for '__vdso_clock_gettime_stick' [-Werror=missing-prototypes]
> arch/sparc/vdso/vclock_gettime.c:307:1: error: no previous prototype for '__vdso_gettimeofday' [-Werror=missing-prototypes]
> arch/sparc/vdso/vclock_gettime.c:343:1: error: no previous prototype for '__vdso_gettimeofday_stick' [-Werror=missing-prototypes]
> 
> Most architectures have already added workarounds for these by adding
> declarations somewhere, but since these are all compatible, we should
> really just have one copy, with an #ifdef check for the 32-bit vs
> 64-bit variant and use that everywhere.
> 

I agree, it is a good idea to have a single header for this purpose.

> Unfortunately, the sparc version is currently incompatible since
> that never added support for __vdso_clock_gettime64() in 32-bit
> userland. For the moment, I'm leaving this one out, as I can't
> easily test it and it requires a larger rework.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com>

> ---
>  arch/arm/include/asm/vdso.h              |  5 -----
>  arch/arm/vdso/vgettimeofday.c            |  1 +
>  arch/arm64/kernel/vdso32/vgettimeofday.c |  1 +
>  arch/csky/kernel/vdso/vgettimeofday.c    | 11 +----------
>  arch/loongarch/vdso/vgettimeofday.c      |  7 +------
>  arch/mips/vdso/vgettimeofday.c           |  1 +
>  arch/riscv/kernel/vdso/vgettimeofday.c   |  7 +------
>  arch/x86/entry/vdso/vclock_gettime.c     | 10 +---------
>  arch/x86/include/asm/vdso/gettimeofday.h |  2 --
>  arch/x86/um/vdso/um_vdso.c               |  1 +
>  include/vdso/gettime.h                   | 23 +++++++++++++++++++++++
>  11 files changed, 31 insertions(+), 38 deletions(-)
>  create mode 100644 include/vdso/gettime.h
> 
> diff --git a/arch/arm/include/asm/vdso.h b/arch/arm/include/asm/vdso.h
> index 422c3afa806a..5b85889f82ee 100644
> --- a/arch/arm/include/asm/vdso.h
> +++ b/arch/arm/include/asm/vdso.h
> @@ -24,11 +24,6 @@ static inline void arm_install_vdso(struct mm_struct *mm, unsigned long addr)
>  
>  #endif /* CONFIG_VDSO */
>  
> -int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts);
> -int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts);
> -int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
> -int __vdso_clock_getres(clockid_t clock_id, struct old_timespec32 *res);
> -
>  #endif /* __ASSEMBLY__ */
>  
>  #endif /* __KERNEL__ */
> diff --git a/arch/arm/vdso/vgettimeofday.c b/arch/arm/vdso/vgettimeofday.c
> index a003beacac76..3554aa35f1ba 100644
> --- a/arch/arm/vdso/vgettimeofday.c
> +++ b/arch/arm/vdso/vgettimeofday.c
> @@ -8,6 +8,7 @@
>  #include <linux/types.h>
>  #include <asm/vdso.h>
>  #include <asm/unwind.h>
> +#include <vdso/gettime.h>
>  
>  int __vdso_clock_gettime(clockid_t clock,
>  			 struct old_timespec32 *ts)
> diff --git a/arch/arm64/kernel/vdso32/vgettimeofday.c b/arch/arm64/kernel/vdso32/vgettimeofday.c
> index 5acff29c5991..e23c7f4ef26b 100644
> --- a/arch/arm64/kernel/vdso32/vgettimeofday.c
> +++ b/arch/arm64/kernel/vdso32/vgettimeofday.c
> @@ -5,6 +5,7 @@
>   * Copyright (C) 2018 ARM Limited
>   *
>   */
> +#include <vdso/gettime.h>
>  
>  int __vdso_clock_gettime(clockid_t clock,
>  			 struct old_timespec32 *ts)
> diff --git a/arch/csky/kernel/vdso/vgettimeofday.c b/arch/csky/kernel/vdso/vgettimeofday.c
> index c4831145eed5..55af30e83752 100644
> --- a/arch/csky/kernel/vdso/vgettimeofday.c
> +++ b/arch/csky/kernel/vdso/vgettimeofday.c
> @@ -2,36 +2,27 @@
>  
>  #include <linux/time.h>
>  #include <linux/types.h>
> +#include <vdso/gettime.h>
>  
>  extern
> -int __vdso_clock_gettime(clockid_t clock,
> -			 struct old_timespec32 *ts);
>  int __vdso_clock_gettime(clockid_t clock,
>  			 struct old_timespec32 *ts)
>  {
>  	return __cvdso_clock_gettime32(clock, ts);
>  }
>  
> -int __vdso_clock_gettime64(clockid_t clock,
> -			   struct __kernel_timespec *ts);
>  int __vdso_clock_gettime64(clockid_t clock,
>  			   struct __kernel_timespec *ts)
>  {
>  	return __cvdso_clock_gettime(clock, ts);
>  }
>  
> -extern
> -int __vdso_gettimeofday(struct __kernel_old_timeval *tv,
> -			struct timezone *tz);
>  int __vdso_gettimeofday(struct __kernel_old_timeval *tv,
>  			struct timezone *tz)
>  {
>  	return __cvdso_gettimeofday(tv, tz);
>  }
>  
> -extern
> -int __vdso_clock_getres(clockid_t clock_id,
> -			struct old_timespec32 *res);
>  int __vdso_clock_getres(clockid_t clock_id,
>  			struct old_timespec32 *res)
>  {
> diff --git a/arch/loongarch/vdso/vgettimeofday.c b/arch/loongarch/vdso/vgettimeofday.c
> index 8f22863bd7ea..0885c1f3a89d 100644
> --- a/arch/loongarch/vdso/vgettimeofday.c
> +++ b/arch/loongarch/vdso/vgettimeofday.c
> @@ -5,23 +5,18 @@
>   * Copyright (C) 2020-2022 Loongson Technology Corporation Limited
>   */
>  #include <linux/types.h>
> +#include <vdso/gettime.h>
>  
> -extern
> -int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
>  int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
>  {
>  	return __cvdso_clock_gettime(clock, ts);
>  }
>  
> -extern
> -int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
>  int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
>  {
>  	return __cvdso_gettimeofday(tv, tz);
>  }
>  
> -extern
> -int __vdso_clock_getres(clockid_t clock_id, struct __kernel_timespec *res);
>  int __vdso_clock_getres(clockid_t clock_id, struct __kernel_timespec *res)
>  {
>  	return __cvdso_clock_getres(clock_id, res);
> diff --git a/arch/mips/vdso/vgettimeofday.c b/arch/mips/vdso/vgettimeofday.c
> index 6b83b6376a4b..604afea3f336 100644
> --- a/arch/mips/vdso/vgettimeofday.c
> +++ b/arch/mips/vdso/vgettimeofday.c
> @@ -9,6 +9,7 @@
>   */
>  #include <linux/time.h>
>  #include <linux/types.h>
> +#include <vdso/gettime.h>
>  
>  #if _MIPS_SIM != _MIPS_SIM_ABI64
>  int __vdso_clock_gettime(clockid_t clock,
> diff --git a/arch/riscv/kernel/vdso/vgettimeofday.c b/arch/riscv/kernel/vdso/vgettimeofday.c
> index cc0d80699c31..b35057802584 100644
> --- a/arch/riscv/kernel/vdso/vgettimeofday.c
> +++ b/arch/riscv/kernel/vdso/vgettimeofday.c
> @@ -8,23 +8,18 @@
>  
>  #include <linux/time.h>
>  #include <linux/types.h>
> +#include <vdso/gettime.h>
>  
> -extern
> -int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
>  int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
>  {
>  	return __cvdso_clock_gettime(clock, ts);
>  }
>  
> -extern
> -int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
>  int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
>  {
>  	return __cvdso_gettimeofday(tv, tz);
>  }
>  
> -extern
> -int __vdso_clock_getres(clockid_t clock_id, struct __kernel_timespec *res);
>  int __vdso_clock_getres(clockid_t clock_id, struct __kernel_timespec *res)
>  {
>  	return __cvdso_clock_getres(clock_id, res);
> diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c
> index 7d70935b6758..0debc194bd78 100644
> --- a/arch/x86/entry/vdso/vclock_gettime.c
> +++ b/arch/x86/entry/vdso/vclock_gettime.c
> @@ -11,12 +11,10 @@
>  #include <linux/time.h>
>  #include <linux/kernel.h>
>  #include <linux/types.h>
> +#include <vdso/gettime.h>
>  
>  #include "../../../../lib/vdso/gettimeofday.c"
>  
> -extern int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
> -extern __kernel_old_time_t __vdso_time(__kernel_old_time_t *t);
> -
>  int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz)
>  {
>  	return __cvdso_gettimeofday(tv, tz);
> @@ -35,9 +33,6 @@ __kernel_old_time_t time(__kernel_old_time_t *t)	__attribute__((weak, alias("__v
>  
>  #if defined(CONFIG_X86_64) && !defined(BUILD_VDSO32_64)
>  /* both 64-bit and x32 use these */
> -extern int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
> -extern int __vdso_clock_getres(clockid_t clock, struct __kernel_timespec *res);
> -
>  int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
>  {
>  	return __cvdso_clock_gettime(clock, ts);
> @@ -56,9 +51,6 @@ int clock_getres(clockid_t, struct __kernel_timespec *)
>  
>  #else
>  /* i386 only */
> -extern int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts);
> -extern int __vdso_clock_getres(clockid_t clock, struct old_timespec32 *res);
> -
>  int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts)
>  {
>  	return __cvdso_clock_gettime32(clock, ts);
> diff --git a/arch/x86/include/asm/vdso/gettimeofday.h b/arch/x86/include/asm/vdso/gettimeofday.h
> index c81858d903dc..a46edb0e0cf7 100644
> --- a/arch/x86/include/asm/vdso/gettimeofday.h
> +++ b/arch/x86/include/asm/vdso/gettimeofday.h
> @@ -337,8 +337,6 @@ u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult)
>  }
>  #define vdso_calc_delta vdso_calc_delta
>  
> -int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts);
> -
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* __ASM_VDSO_GETTIMEOFDAY_H */
> diff --git a/arch/x86/um/vdso/um_vdso.c b/arch/x86/um/vdso/um_vdso.c
> index ff0f3b4b6c45..63768dd347ce 100644
> --- a/arch/x86/um/vdso/um_vdso.c
> +++ b/arch/x86/um/vdso/um_vdso.c
> @@ -12,6 +12,7 @@
>  #include <linux/time.h>
>  #include <linux/getcpu.h>
>  #include <asm/unistd.h>
> +#include <vdso/gettime.h>
>  
>  int __vdso_clock_gettime(clockid_t clock, struct __kernel_old_timespec *ts)
>  {
> diff --git a/include/vdso/gettime.h b/include/vdso/gettime.h
> new file mode 100644
> index 000000000000..c50d152e7b3e
> --- /dev/null
> +++ b/include/vdso/gettime.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _VDSO_GETTIME_H
> +#define _VDSO_GETTIME_H
> +
> +#include <linux/types.h>
> +
> +struct __kernel_timespec;
> +struct timezone;
> +
> +#if !defined(CONFIG_64BIT) || defined(BUILD_VDSO32_64)
> +struct old_timespec32;
> +int __vdso_clock_getres(clockid_t clock, struct old_timespec32 *res);
> +int __vdso_clock_gettime(clockid_t clock, struct old_timespec32 *ts);
> +#else
> +int __vdso_clock_getres(clockid_t clock, struct __kernel_timespec *res);
> +int __vdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts);
> +#endif
> +
> +__kernel_old_time_t __vdso_time(__kernel_old_time_t *t);
> +int __vdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz);
> +int __vdso_clock_gettime64(clockid_t clock, struct __kernel_timespec *ts);
> +
> +#endif

-- 
Regards,
Vincenzo

^ permalink raw reply

* Re: Missing a write memory barrier in tls_init()
From: Dae R. Jeong @ 2023-11-10 10:22 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: Jakub Kicinski, borisp, john.fastabend, davem, edumazet, pabeni,
	netdev, linux-kernel, ywchoi
In-Reply-To: <ZUtP7lMqFnNK8lw_@hog>

On Wed, Nov 08, 2023 at 10:07:58AM +0100, Sabrina Dubroca wrote:
> 2023-11-07, 18:53:24 -0800, Jakub Kicinski wrote:
> > On Tue, 7 Nov 2023 23:45:46 +0100 Sabrina Dubroca wrote:
> > > Wouldn't it be enough to just move the rcu_assign_pointer after ctx is
> > > fully initialized, ie just before update_sk_prot? also clearer wrt
> > > RCU.
> > 
> > I'm not sure, IIUC rcu_assign_pointer() is equivalent to
> > WRITE_ONCE() on any sane architecture, it depends on address
> > dependencies to provide ordering.
> 
> Not what the doc says:
> 
>     /**
>      * rcu_assign_pointer() - assign to RCU-protected pointer
>      [...]
>      * Inserts memory barriers on architectures that require them
>      * (which is most of them), and also prevents the compiler from
>      * reordering the code that initializes the structure after the pointer
>      * assignment.
>      [...]
>      */
> 
> And it uses smp_store_release (unless writing NULL).
> 

I think Sabrina is right. We can rely on the release semantic implied
in rcu_assign_pointer(). Simply moving rcu_assign_pointer() to the end
of tls_ctx_create() should prevent a scenario what I thought (ie.,
store-store reordering between ctx->sk_proto and sk->sk_prot).

diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 1c2c6800949d..d20b823c68d4 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -816,9 +816,9 @@ struct tls_context *tls_ctx_create(struct sock *sk)
                return NULL;
 
        mutex_init(&ctx->tx_lock);
-       rcu_assign_pointer(icsk->icsk_ulp_data, ctx);
        ctx->sk_proto = READ_ONCE(sk->sk_prot);
        ctx->sk = sk;
+       rcu_assign_pointer(icsk->icsk_ulp_data, ctx);
        return ctx;
 }

But what I also wonder is that, do we need to ensure that
ctx->{tx,rx}_conf is visible before updating sk->sk_prot? If so, as
Sabrina suggested, we may want to move rcu_assign_pointer() right
before update_sk_prot().


Best regards,
Dae R. Jeong

^ permalink raw reply related

* [PATCH] tls: fix missing memory barrier in tls_init
From: Dae R. Jeong @ 2023-11-10 10:57 UTC (permalink / raw)
  To: borisp, john.fastabend, kuba, davem, edumazet, pabeni, netdev,
	linux-kernel
  Cc: ywchoi

In tls_init(), a write memory barrier is missing, and store-store
reordering may cause NULL dereference in tls_{setsockopt,getsockopt}.

CPU0                               CPU1
-----                              -----
// In tls_init()
// In tls_ctx_create()
ctx = kzalloc()
ctx->sk_proto = READ_ONCE(sk->sk_prot) -(1)

// In update_sk_prot()
WRITE_ONCE(sk->sk_prot, tls_prots)     -(2)

                                   // In sock_common_setsockopt()
                                   READ_ONCE(sk->sk_prot)->setsockopt()

                                   // In tls_{setsockopt,getsockopt}()
                                   ctx->sk_proto->setsockopt()    -(3)

In the above scenario, when (1) and (2) are reordered, (3) can observe
the NULL value of ctx->sk_proto, causing NULL dereference.

To fix it, we rely on rcu_assign_pointer() which implies the release
barrier semantic. By moving rcu_assign_pointer() after ctx is fully
initialized, we can ensure that all fields of ctx are visible when
changing sk->sk_prot.

Also, as Sabrina suggested, this patch gets rid of tls_ctx_create(),
and move all that into tls_init().

Signed-off-by: Dae R. Jeong <threeearcat@gmail.com>
---
 net/tls/tls_main.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 1c2c6800949d..235fa93dc7ef 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -806,22 +806,6 @@ static int tls_setsockopt(struct sock *sk, int level, int optname,
 	return do_tls_setsockopt(sk, optname, optval, optlen);
 }
 
-struct tls_context *tls_ctx_create(struct sock *sk)
-{
-	struct inet_connection_sock *icsk = inet_csk(sk);
-	struct tls_context *ctx;
-
-	ctx = kzalloc(sizeof(*ctx), GFP_ATOMIC);
-	if (!ctx)
-		return NULL;
-
-	mutex_init(&ctx->tx_lock);
-	rcu_assign_pointer(icsk->icsk_ulp_data, ctx);
-	ctx->sk_proto = READ_ONCE(sk->sk_prot);
-	ctx->sk = sk;
-	return ctx;
-}
-
 static void build_proto_ops(struct proto_ops ops[TLS_NUM_CONFIG][TLS_NUM_CONFIG],
 			    const struct proto_ops *base)
 {
@@ -933,6 +917,7 @@ static void build_protos(struct proto prot[TLS_NUM_CONFIG][TLS_NUM_CONFIG],
 
 static int tls_init(struct sock *sk)
 {
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tls_context *ctx;
 	int rc = 0;
 
@@ -954,14 +939,27 @@ static int tls_init(struct sock *sk)
 
 	/* allocate tls context */
 	write_lock_bh(&sk->sk_callback_lock);
-	ctx = tls_ctx_create(sk);
+	ctx = kzalloc(sizeof(*ctx), GFP_ATOMIC);
 	if (!ctx) {
 		rc = -ENOMEM;
 		goto out;
 	}
 
+	mutex_init(&ctx->tx_lock);
+	ctx->sk_proto = READ_ONCE(sk->sk_prot);
+	ctx->sk = sk;
 	ctx->tx_conf = TLS_BASE;
 	ctx->rx_conf = TLS_BASE;
+	/* rcu_assign_pointer() should be called after initialization of
+	 * all fields of ctx. It ensures that all fields of ctx are
+	 * visible before changing sk->sk_prot, and prevents reading of
+	 * uninitialized fields in tls_{getsockopt,setsockopt}. Note that
+	 * we do not need a read barrier in tls_{getsockopt,setsockopt} as
+	 * there is an address dependency between
+	 * sk->sk_proto->{getsockopt,setsockopt} and ctx->sk_proto.
+	 */
+	rcu_assign_pointer(icsk->icsk_ulp_data, ctx);
+
 	update_sk_prot(sk, ctx);
 out:
 	write_unlock_bh(&sk->sk_callback_lock);
-- 
2.42.1


^ permalink raw reply related

* Re: [PATCH net] ipvlan: add ipvlan_route_v6_outbound() helper
From: patchwork-bot+netdevbpf @ 2023-11-10 11:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, netdev, eric.dumazet, syzkaller, maheshb,
	willemb
In-Reply-To: <20231109152241.3754521-1-edumazet@google.com>

Hello:

This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:

On Thu,  9 Nov 2023 15:22:41 +0000 you wrote:
> Inspired by syzbot reports using a stack of multiple ipvlan devices.
> 
> Reduce stack size needed in ipvlan_process_v6_outbound() by moving
> the flowi6 struct used for the route lookup in an non inlined
> helper. ipvlan_route_v6_outbound() needs 120 bytes on the stack,
> immediately reclaimed.
> 
> [...]

Here is the summary with links:
  - [net] ipvlan: add ipvlan_route_v6_outbound() helper
    https://git.kernel.org/netdev/net/c/18f039428c7d

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v2] MAINTAINERS: net: Update reviewers for TI's Ethernet drivers
From: patchwork-bot+netdevbpf @ 2023-11-10 11:00 UTC (permalink / raw)
  To: Ravi Gunasekaran
  Cc: netdev, linux-omap, linux-kernel, s-vadapalli, nm, srk, rogerq
In-Reply-To: <20231110092749.3618-1-r-gunasekaran@ti.com>

Hello:

This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:

On Fri, 10 Nov 2023 14:57:49 +0530 you wrote:
> Grygorii is no longer associated with TI and messages addressed to
> him bounce.
> 
> Add Siddharth, Roger and myself as reviewers.
> 
> Signed-off-by: Ravi Gunasekaran <r-gunasekaran@ti.com>
> 
> [...]

Here is the summary with links:
  - [v2] MAINTAINERS: net: Update reviewers for TI's Ethernet drivers
    https://git.kernel.org/netdev/net/c/cbe9e68e1e0f

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: Missing a write memory barrier in tls_init()
From: Dae R. Jeong @ 2023-11-10 11:04 UTC (permalink / raw)
  To: Sabrina Dubroca
  Cc: Jakub Kicinski, borisp, john.fastabend, davem, edumazet, pabeni,
	netdev, linux-kernel, ywchoi
In-Reply-To: <ZU4Ecx2qbdqGfRVw@dragonet>

On Fri, Nov 10, 2023 at 07:22:48PM +0900, Dae R. Jeong wrote:
> On Wed, Nov 08, 2023 at 10:07:58AM +0100, Sabrina Dubroca wrote:
> > 2023-11-07, 18:53:24 -0800, Jakub Kicinski wrote:
> > > On Tue, 7 Nov 2023 23:45:46 +0100 Sabrina Dubroca wrote:
> > > > Wouldn't it be enough to just move the rcu_assign_pointer after ctx is
> > > > fully initialized, ie just before update_sk_prot? also clearer wrt
> > > > RCU.
> > > 
> > > I'm not sure, IIUC rcu_assign_pointer() is equivalent to
> > > WRITE_ONCE() on any sane architecture, it depends on address
> > > dependencies to provide ordering.
> > 
> > Not what the doc says:
> > 
> >     /**
> >      * rcu_assign_pointer() - assign to RCU-protected pointer
> >      [...]
> >      * Inserts memory barriers on architectures that require them
> >      * (which is most of them), and also prevents the compiler from
> >      * reordering the code that initializes the structure after the pointer
> >      * assignment.
> >      [...]
> >      */
> > 
> > And it uses smp_store_release (unless writing NULL).
> > 
> 
> I think Sabrina is right. We can rely on the release semantic implied
> in rcu_assign_pointer(). Simply moving rcu_assign_pointer() to the end
> of tls_ctx_create() should prevent a scenario what I thought (ie.,
> store-store reordering between ctx->sk_proto and sk->sk_prot).
> 
> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index 1c2c6800949d..d20b823c68d4 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -816,9 +816,9 @@ struct tls_context *tls_ctx_create(struct sock *sk)
>                 return NULL;
>  
>         mutex_init(&ctx->tx_lock);
> -       rcu_assign_pointer(icsk->icsk_ulp_data, ctx);
>         ctx->sk_proto = READ_ONCE(sk->sk_prot);
>         ctx->sk = sk;
> +       rcu_assign_pointer(icsk->icsk_ulp_data, ctx);
>         return ctx;
>  }
> 
> But what I also wonder is that, do we need to ensure that
> ctx->{tx,rx}_conf is visible before updating sk->sk_prot? If so, as
> Sabrina suggested, we may want to move rcu_assign_pointer() right
> before update_sk_prot().
> 
> 
> Best regards,
> Dae R. Jeong

I sent a patch by taking suggestions from Sabrina. The patches 1)
moves rcu_assign_pointer() after fully initializing ctx, and 2) gets
rid of tls_ctx_create().

I'm not sure whether removing tls_ctx_create() is a good idea or not,
but it still did not fully initialize ctx (i.e., ctx->{tx,rx}_conf).

Let me know if there is any issue, then I will rewrite a patch.


Best regards,
Dae R. Jeong

^ permalink raw reply

* Re: [PATCH v2 net 0/7] qbv cycle time extension/truncation
From: Abdul Rahim, Faizal @ 2023-11-10 11:06 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Vinicius Costa Gomes, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	David S . Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	netdev, linux-kernel
In-Reply-To: <20231108155144.xadpltcdw2rhdpkv@skbuf>



On 8/11/2023 11:51 pm, Vladimir Oltean wrote:
> Hi Faizal,
> 
> On Tue, Nov 07, 2023 at 06:20:16AM -0500, Faizal Rahim wrote:
>> According to IEEE Std. 802.1Q-2018 section Q.5 CycleTimeExtension,
>> the Cycle Time Extension variable allows this extension of the last old
>> cycle to be done in a defined way. If the last complete old cycle would
>> normally end less than OperCycleTimeExtension nanoseconds before the new
>> base time, then the last complete cycle before AdminBaseTime is reached
>> is extended so that it ends at AdminBaseTime.
>>
>> Changes in v2:
>>
>> - Added 's64 cycle_time_correction' in 'sched_gate_list struct'.
>> - Removed sched_changed created in v1 since the new cycle_time_correction
>>    field can also serve to indicate the need for a schedule change.
>> - Added 'bool correction_active' in 'struct sched_entry' to represent
>>    the correction state from the entry's perspective and return corrected
>>    interval value when active.
>> - Fix cycle time correction logics for the next entry in advance_sched()
>> - Fix and implement proper cycle time correction logics for current
>>    entry in taprio_start_sched()
>>
>> v1 at:
>> https://lore.kernel.org/lkml/20230530082541.495-1-muhammad.husaini.zulkifli@intel.com/
> 
> I like what came of this patch series. Thanks for following up and
> taking over. I have some comments on individual patches.
> 

Hi Vladimir,

Thanks a bunch for your review and your patience with some of my basic 
mistakes.
Appreciate the time and effort you put into it.
I'll take a bit to double-check the code and retest some stuff.

Will loop back with you soon.

Cheers.

^ permalink raw reply

* Re: [PATCH net] net: sched: fix warn on htb offloaded class creation
From: Maxim Mikityanskiy @ 2023-11-10 11:07 UTC (permalink / raw)
  To: Chittim, Madhu
  Cc: Paolo Abeni, netdev, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Tariq Toukan,
	Gal Pressman, Saeed Mahameed, xuejun.zhang, sridhar.samudrala
In-Reply-To: <5d873c14-9d17-4c48-8e11-951b99270b75@intel.com>

On Thu, 09 Nov 2023 at 13:54:17 -0800, Chittim, Madhu wrote:
> 
> 
> On 10/31/2023 7:40 AM, Maxim Mikityanskiy wrote:
> > On Tue, 31 Oct 2023 at 10:11:14 +0100, Paolo Abeni wrote:
> > > Hi,
> > > 
> > > I'm sorry for the late reply.
> > > 
> > > On Fri, 2023-10-27 at 16:57 +0300, Maxim Mikityanskiy wrote:
> > > > I believe this is not the right fix.
> > > > 
> > > > On Thu, 26 Oct 2023 at 17:36:48 +0200, Paolo Abeni wrote:
> > > > > The following commands:
> > > > > 
> > > > > tc qdisc add dev eth1 handle 2: root htb offload
> > > > > tc class add dev eth1 parent 2: classid 2:1 htb rate 5mbit burst 15k
> > > > > 
> > > > > yeld to a WARN in the HTB qdisc:
> > > > 
> > > > Something is off here. These are literally the most basic commands one
> > > > could invoke with HTB offload, I'm sure they worked. Is it something
> > > > that broke recently? Tariq/Gal/Saeed, could you check them on a Mellanox
> > > > NIC?
> > > > 
> > > > > 
> > > > >   WARNING: CPU: 2 PID: 1583 at net/sched/sch_htb.c:1959
> > > > >   CPU: 2 PID: 1583 Comm: tc Kdump: loaded 6.6.0-rc2.mptcp_7895773e5235+ #59
> > > > >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-1.fc37 04/01/2014
> > > > >   RIP: 0010:htb_change_class+0x25c4/0x2e30 [sch_htb]
> > > > >   Code: 24 58 48 b8 00 00 00 00 00 fc ff df 48 89 ca 48 c1 ea 03 80 3c 02 00 0f 85 92 01 00 00 49 89 8c 24 b0 01 00 00 e9 77 fc ff ff <0f> 0b e9 15 ec ff ff 80 3d f8 35 00 00 00 0f 85 d4 f9 ff ff ba 32
> > > > >   RSP: 0018:ffffc900015df240 EFLAGS: 00010246
> > > > >   RAX: 0000000000000000 RBX: ffff88811b4ca000 RCX: ffff88811db42800
> > > > >   RDX: 1ffff11023b68502 RSI: ffffffffaf2e6a00 RDI: ffff88811db42810
> > > > >   RBP: ffff88811db45000 R08: 0000000000000001 R09: fffffbfff664bbc9
> > > > >   R10: ffffffffb325de4f R11: ffffffffb2d33748 R12: 0000000000000000
> > > > >   R13: ffff88811db43000 R14: ffff88811b4caaac R15: ffff8881252c0030
> > > > >   FS:  00007f6c1f126740(0000) GS:ffff88815aa00000(0000) knlGS:0000000000000000
> > > > >   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > >   CR2: 000055dca8e5b4a8 CR3: 000000011bc7a006 CR4: 0000000000370ee0
> > > > >   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > >   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > >   Call Trace:
> > > > >   <TASK>
> > > > >    tc_ctl_tclass+0x394/0xeb0
> > > > >    rtnetlink_rcv_msg+0x2f5/0xaa0
> > > > >    netlink_rcv_skb+0x12e/0x3a0
> > > > >    netlink_unicast+0x421/0x730
> > > > >    netlink_sendmsg+0x79e/0xc60
> > > > >    ____sys_sendmsg+0x95a/0xc20
> > > > >    ___sys_sendmsg+0xee/0x170
> > > > >    __sys_sendmsg+0xc6/0x170
> > > > >   do_syscall_64+0x58/0x80
> > > > >   entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > > > > 
> > > > > The first command creates per TX queue pfifo qdiscs in
> > > > > tc_modify_qdisc() -> htb_init() and grafts the pfifo to each dev_queue
> > > > > via tc_modify_qdisc() ->  qdisc_graft() -> htb_attach().
> > > > 
> > > > Not exactly; it grafts pfifo to direct queues only. htb_attach_offload
> > > > explicitly grafts noop to all the remaining queues.
> > > 
> > > num_direct_qdiscs == real_num_tx_queues:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/net/sched/sch_htb.c#L1101
> > > 
> > > pfifo will be configured on all the TX queues available at TC creation
> > > time, right?
> > 
> > Yes, all real TX queues will be used as direct queues (for unclassified
> > traffic). num_tx_queues should be somewhat bigger than
> > real_num_tx_queues - it should reserve a queue per potential leaf class.
> > 
> > pfifo is configured on direct queues, and the reserved queues have noop.
> > Then, when a new leaf class is added (TC_HTB_LEAF_ALLOC_QUEUE), the
> > driver allocates a new queue and increases real_num_tx_queues. HTB
> > assigns a pfifo qdisc to the newly allocated queue.
> > 
> > Changing the hierarchy (deleting a node or converting an inner node to a
> > leaf) may reorder the classful queues (with indexes >= the initial
> > real_num_tx_queues), so that there are no gaps.
> > 
> > > Lacking a mlx card with offload support I hack basic htb support in
> > > netdevsim and I observe the splat on top of such device. I can as well
> > > share the netdevsim patch - it will need some clean-up.
> > 
> > I will be happy to review the netdevsim patch, but I don't promise
> > prompt responsiveness.
> > 
> > > > 
> > > > > When the command completes, the qdisc_sleeping for each dev_queue is a
> > > > > pfifo one. The next class creation will trigger the reported splat.
> > > > > 
> > > > > Address the issue taking care of old non-builtin qdisc in
> > > > > htb_change_class().
> > > > > 
> > > > > Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
> > > > > Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> > > > > ---
> > > > >   net/sched/sch_htb.c | 3 +--
> > > > >   1 file changed, 1 insertion(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> > > > > index 0d947414e616..dc682bd542b4 100644
> > > > > --- a/net/sched/sch_htb.c
> > > > > +++ b/net/sched/sch_htb.c
> > > > > @@ -1955,8 +1955,7 @@ static int htb_change_class(struct Qdisc *sch, u32 classid,
> > > > >   				qdisc_refcount_inc(new_q);
> > > > >   			}
> > > > >   			old_q = htb_graft_helper(dev_queue, new_q);
> > > > > -			/* No qdisc_put needed. */
> > > > > -			WARN_ON(!(old_q->flags & TCQ_F_BUILTIN));
> > > > > +			qdisc_put(old_q);
> > > > 
> > > > We can get here after one of two cases above:
> > > > 
> > > > 1. A new queue is allocated with TC_HTB_LEAF_ALLOC_QUEUE. It's supposed
> > > > to have a noop qdisc by default (after htb_attach_offload).
> > > 
> > > So most likely the trivial netdevsim implementation I used was not good
> > > enough.
> > > 
> > > Which constrains should respect TC_HTB_LEAF_ALLOC_QUEUE WRT the
> > > returned qid value? should it in the (real_num_tx_queues,
> > > num_tx_queues] range?
> > 
> > Let's say N is real_num_tx_queues as it was at the moment of attaching.
> > HTB queues should be allocated from [N, num_tx_queues), and
> > real_num_tx_queues should be increased accordingly. It should not return
> > queues number [0, N).
> > 
> > Deletions should fill the gaps: if queue X is being deleted, N <= X <
> > real_num_tx_queues - 1, then the gap should be filled with queue number
> > real_num_tx_queues - 1 by swapping the queues (real_num_tx_queues will
> > be decreased by 1 accordingly). Some care also needs to be taken when
> > converting inner-to-leaf (TC_HTB_LEAF_DEL_LAST) and leaf-to-inner (it's
> > better to get insights from [1], there are also some comments).
> > 
> > > Can HTB actually configure H/W shaping on
> > > real_num_tx_queues?
> > 
> > It will be on real_num_tx_queues, but after it's increased to add new
> > HTB queues. The original queues [0, N) are used for direct traffic, same
> > as the non-offloaded HTB's direct_queue (it's not shaped).
> > 
> > > I find no clear documentation WRT the above.
> > 
> > I'm sorry for the lack of documentation. All I have is the commit
> > message [2] and a netdev talk [3]. Maybe the slides could be of some
> > use...
> > 
> > I hope the above explanation clarifies something, and feel free to ask
> > further questions, I'll be glad to explain what hasn't been documented
> > properly.
> 
> We would like to enable Tx rate limiting using htb offload on all the
> existing queues.

I don't seem to understand how you see it possible with HTB.

1. Where would the unclassified traffic go? HTB uses a set of filters
that steer certain traffic to certain classes. Everything that doesn't
match the filter goes to the direct queue, which doesn't have shaping.
In the non-offloaded HTB it's struct htb_sched::direct_queue. With HTB
offload it's mapped to the standard set of queues.

2. How do you see the mapping of HTB classes to netdev queues if you use
a fixed set of queues? In the current implementation of HTB offload,
each new HTB class creates a new queue. If you are bound to using only N
standard queues, how would the allocation work when you add the second,
third, etc. HTB classes?

> We are able to do with the following set of commands with
> Paolo's patch
> 
> tc qdisc add dev enp175s0v0 handle 1: root htb offload
> tc class add dev enp175s0v0 parent 1: classid 1:1 htb rate 1000mbit ceil
> 2000mbit burst 100k

As long as you don't attach any filters, this set of commands is not
supposed to shape anything. It should TX full-speed, you should be able
to see this effect if you configure non-offloaded HTB. As long as there
are no filters, and the default classid doesn't match your only class,
all traffic will go to the direct queue.

What is your goal? If you need shaping for the whole netdev, maybe HTB
is not needed, and it's enough to attach a catchall filter with the
police action? Or use /sys/class/net/eth0/queues/tx-*/tx_maxrate
per-queue? Or tc mqprio mode channel, that allows to group existing
queues and assign rate limits?

> where
>   classid 1:1 is tx queue0
>   tx_minrate is rate 1000mbps
>   tx_maxrate is ceil 2000mbps
> 
> 
> In order to not break your implementation could bring in if condition
> instead WARN_ON, something like below
> 	if (!(old_q->flags & TCQ_F_BUILTIN))
> 		qdisc_put(old_q);
> 
> Would this work for you, please advise.

One of the reasons this WARN is there to prevent misuse of the API (the
other is to catch bugs that lead to inconsistencies in the internal
state). We can't just remove the WARN and start misusing the API. Your
ideas require more major design changes, but before that I need to
understand your approach to solving (1) and (2).

^ permalink raw reply

* Re: [PATCH net] tty: Fix uninit-value access in ppp_sync_receive()
From: patchwork-bot+netdevbpf @ 2023-11-10 11:10 UTC (permalink / raw)
  To: Shigeru Yoshida
  Cc: davem, edumazet, kuba, pabeni, linux-ppp, netdev, linux-kernel
In-Reply-To: <20231108154420.1474853-1-syoshida@redhat.com>

Hello:

This patch was applied to netdev/net.git (main)
by David S. Miller <davem@davemloft.net>:

On Thu,  9 Nov 2023 00:44:20 +0900 you wrote:
> KMSAN reported the following uninit-value access issue:
> 
> =====================================================
> BUG: KMSAN: uninit-value in ppp_sync_input drivers/net/ppp/ppp_synctty.c:690 [inline]
> BUG: KMSAN: uninit-value in ppp_sync_receive+0xdc9/0xe70 drivers/net/ppp/ppp_synctty.c:334
>  ppp_sync_input drivers/net/ppp/ppp_synctty.c:690 [inline]
>  ppp_sync_receive+0xdc9/0xe70 drivers/net/ppp/ppp_synctty.c:334
>  tiocsti+0x328/0x450 drivers/tty/tty_io.c:2295
>  tty_ioctl+0x808/0x1920 drivers/tty/tty_io.c:2694
>  vfs_ioctl fs/ioctl.c:51 [inline]
>  __do_sys_ioctl fs/ioctl.c:871 [inline]
>  __se_sys_ioctl+0x211/0x400 fs/ioctl.c:857
>  __x64_sys_ioctl+0x97/0xe0 fs/ioctl.c:857
>  do_syscall_x64 arch/x86/entry/common.c:51 [inline]
>  do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
>  entry_SYSCALL_64_after_hwframe+0x63/0x6b
> 
> [...]

Here is the summary with links:
  - [net] tty: Fix uninit-value access in ppp_sync_receive()
    https://git.kernel.org/netdev/net/c/719639853d88

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* RE: [PATCH net 2/3] dpll: fix pin dump crash for rebound module
From: Kubalewski, Arkadiusz @ 2023-11-10 11:18 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev@vger.kernel.org, vadim.fedorenko@linux.dev,
	Michalik, Michal, Olech, Milena, pabeni@redhat.com,
	kuba@kernel.org
In-Reply-To: <ZU4AoNqxo6j5zy+V@nanopsycho>

>From: Jiri Pirko <jiri@resnulli.us>
>Sent: Friday, November 10, 2023 11:06 AM
>
>Fri, Nov 10, 2023 at 10:01:50AM CET, arkadiusz.kubalewski@intel.com wrote:
>>>From: Jiri Pirko <jiri@resnulli.us>
>>>Sent: Friday, November 10, 2023 7:46 AM
>>>
>>>Fri, Nov 10, 2023 at 12:32:21AM CET, arkadiusz.kubalewski@intel.com
>>>wrote:
>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>Sent: Thursday, November 9, 2023 7:06 PM
>>>>>
>>>>>Thu, Nov 09, 2023 at 05:30:20PM CET, arkadiusz.kubalewski@intel.com
>>>>>wrote:
>>>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>Sent: Thursday, November 9, 2023 2:19 PM
>>>>>>>
>>>>>>>Thu, Nov 09, 2023 at 01:20:48PM CET, arkadiusz.kubalewski@intel.com
>>>>>>>wrote:
>>>>>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>>>Sent: Wednesday, November 8, 2023 3:30 PM
>>>>>>>>>
>>>>>>>>>Wed, Nov 08, 2023 at 11:32:25AM CET, arkadiusz.kubalewski@intel.com
>>>>>>>>>wrote:
>>>>>>>>>>When a kernel module is unbound but the pin resources were not
>>>>>>>>>>entirely
>>>>>>>>>>freed (other kernel module instance have had kept the reference to
>>>>>>>>>>that
>>>>>>>>>>pin), and kernel module is again bound, the pin properties would
>>>>>>>>>>not
>>>>>>>>>>be
>>>>>>>>>>updated (the properties are only assigned when memory for the pin
>>>>>>>>>>is
>>>>>>>>>>allocated), prop pointer still points to the kernel module memory
>>>>>>>>>>of
>>>>>>>>>>the kernel module which was deallocated on the unbind.
>>>>>>>>>>
>>>>>>>>>>If the pin dump is invoked in this state, the result is a kernel
>>>>>>>>>>crash.
>>>>>>>>>>Prevent the crash by storing persistent pin properties in dpll
>>>>>>>>>>subsystem,
>>>>>>>>>>copy the content from the kernel module when pin is allocated,
>>>>>>>>>>instead
>>>>>>>>>>of
>>>>>>>>>>using memory of the kernel module.
>>>>>>>>>>
>>>>>>>>>>Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base
>>>>>>>>>>functions")
>>>>>>>>>>Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base
>>>>>>>>>>functions")
>>>>>>>>>>Signed-off-by: Arkadiusz Kubalewski
>>>>>>>>>><arkadiusz.kubalewski@intel.com>
>>>>>>>>>>---
>>>>>>>>>> drivers/dpll/dpll_core.c    |  4 ++--
>>>>>>>>>> drivers/dpll/dpll_core.h    |  4 ++--
>>>>>>>>>> drivers/dpll/dpll_netlink.c | 28 ++++++++++++++--------------
>>>>>>>>>> 3 files changed, 18 insertions(+), 18 deletions(-)
>>>>>>>>>>
>>>>>>>>>>diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c
>>>>>>>>>>index 3568149b9562..4077b562ba3b 100644
>>>>>>>>>>--- a/drivers/dpll/dpll_core.c
>>>>>>>>>>+++ b/drivers/dpll/dpll_core.c
>>>>>>>>>>@@ -442,7 +442,7 @@ dpll_pin_alloc(u64 clock_id, u32 pin_idx,
>>>>>>>>>>struct
>>>>>>>>>>module *module,
>>>>>>>>>> 		ret = -EINVAL;
>>>>>>>>>> 		goto err;
>>>>>>>>>> 	}
>>>>>>>>>>-	pin->prop = prop;
>>>>>>>>>>+	memcpy(&pin->prop, prop, sizeof(pin->prop));
>>>>>>>>>
>>>>>>>>>Odd, you don't care about the pointer within this structure?
>>>>>>>>>
>>>>>>>>
>>>>>>>>Well, true. Need a fix.
>>>>>>>>Wondering if copying idea is better than just assigning prop pointer
>>>>>>>>on
>>>>>>>>each call to dpll_pin_get(..) function (when pin already exists)?
>>>>>>>
>>>>>>>Not sure what do you mean. Examples please.
>>>>>>>
>>>>>>
>>>>>>Sure,
>>>>>>
>>>>>>Basically this change:
>>>>>>
>>>>>>diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c
>>>>>>index ae884b92d68c..06b72d5877c3 100644
>>>>>>--- a/drivers/dpll/dpll_core.c
>>>>>>+++ b/drivers/dpll/dpll_core.c
>>>>>>@@ -483,6 +483,7 @@ dpll_pin_get(u64 clock_id, u32 pin_idx, struct
>>>>>>module
>>>>>>*module,
>>>>>>                    pos->pin_idx == pin_idx &&
>>>>>>                    pos->module == module) {
>>>>>>                        ret = pos;
>>>>>>+                       pos->prop = prop;
>>>>>>                        refcount_inc(&ret->refcount);
>>>>>>                        break;
>>>>>>                }
>>>>>>
>>>>>>would replace whole of this patch changes, although seems a bit hacky.
>>>>>
>>>>>Or event better, as I suggested in the other patch reply, resolve this
>>>>>internally in the driver registering things only when they are valid.
>>>>>Much better then to hack anything in dpll core.
>>>>>
>>>>
>>>>This approach seemed to me hacky, that is why started with coping the
>>>>data.
>>>>It is not about registering, rather about unregistering on driver
>>>>unbind, which brakes things, and currently cannot be recovered in
>>>>described case.
>>>
>>>Sure it can. PF0 unbind-> internal notification-> PF1 unregisters all
>>>related object. Very clean and simple.
>>>
>>
>>What you are suggesting is:
>>- special purpose bus in the driver,
>
>No, it is a simple notificator. Very common infra over the whole
>kernel code.
>

No difference if this is simple or not, it is special purpose.

>
>>- dpll-related,
>
>Is this the only thing that PF0 is special with? Perhaps you can
>utilize this for other features as well, since your fw design is like
>this.
>

None api user is allowed to the unordered driver bind when there are muxed
pins involved.
This requires a fix in the api not in the driver.

>
>>- not needed,
>>- prone for errors.
>>
>>The dpll subsystem is here to make driver life easier.
>
>No, the subsystem is never here to handle device specific issues. And
>your PF0 dependency is very clearly something device specific. Don't
>pollute the dpll subsystem with workaround to handle specific device
>needs. create/register the dplls objects from your driver only when it
>is valid to do so. Make sure the lifetime of such object stays in
>the scope of validity. Handle that in the driver. Very clear and simple.
>

In our case this is PF0 but this is broader issue. The muxed pins
infrastructure in now slightly broken and I am fixing it.
From the beginning it was designed to allow separated driver instances
to create a device and connect their pins with it.
Any driver which would use it would face this issue. What you are trying
to imply is that it is better to put traffic lights on each car instead
of putting them on crossroads.

>Thanks!
>

Thank you!
Arkadiusz

>
>>
>>Thank you!
>>Arkadiusz
>>
>>>
>>>>
>>>>Thank you!
>>>>Arkadiusz
>>>>
>>>>>
>>>>>>
>>>>>>Thank you!
>>>>>>Arkadiusz
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>Thank you!
>>>>>>>>Arkadiusz
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 	refcount_set(&pin->refcount, 1);
>>>>>>>>>> 	xa_init_flags(&pin->dpll_refs, XA_FLAGS_ALLOC);
>>>>>>>>>> 	xa_init_flags(&pin->parent_refs, XA_FLAGS_ALLOC);
>>>>>>>>>>@@ -634,7 +634,7 @@ int dpll_pin_on_pin_register(struct dpll_pin
>>>>>>>>>>*parent,
>>>>>>>>>>struct dpll_pin *pin,
>>>>>>>>>> 	unsigned long i, stop;
>>>>>>>>>> 	int ret;
>>>>>>>>>>
>>>>>>>>>>-	if (WARN_ON(parent->prop->type != DPLL_PIN_TYPE_MUX))
>>>>>>>>>>+	if (WARN_ON(parent->prop.type != DPLL_PIN_TYPE_MUX))
>>>>>>>>>> 		return -EINVAL;
>>>>>>>>>>
>>>>>>>>>> 	if (WARN_ON(!ops) ||
>>>>>>>>>>diff --git a/drivers/dpll/dpll_core.h b/drivers/dpll/dpll_core.h
>>>>>>>>>>index 5585873c5c1b..717f715015c7 100644
>>>>>>>>>>--- a/drivers/dpll/dpll_core.h
>>>>>>>>>>+++ b/drivers/dpll/dpll_core.h
>>>>>>>>>>@@ -44,7 +44,7 @@ struct dpll_device {
>>>>>>>>>>  * @module:		module of creator
>>>>>>>>>>  * @dpll_refs:		hold referencees to dplls pin was
>>>>>>>>>>registered
>>>>>>>>>>with
>>>>>>>>>>  * @parent_refs:	hold references to parent pins pin was
>>>>>>>>>>registered
>>>>>>>>>>with
>>>>>>>>>>- * @prop:		pointer to pin properties given by registerer
>>>>>>>>>>+ * @prop:		pin properties copied from the registerer
>>>>>>>>>>  * @rclk_dev_name:	holds name of device when pin can recover
>>>>>>>>>>clock
>>>>>>>>>>from it
>>>>>>>>>>  * @refcount:		refcount
>>>>>>>>>>  **/
>>>>>>>>>>@@ -55,7 +55,7 @@ struct dpll_pin {
>>>>>>>>>> 	struct module *module;
>>>>>>>>>> 	struct xarray dpll_refs;
>>>>>>>>>> 	struct xarray parent_refs;
>>>>>>>>>>-	const struct dpll_pin_properties *prop;
>>>>>>>>>>+	struct dpll_pin_properties prop;
>>>>>>>>>> 	refcount_t refcount;
>>>>>>>>>> };
>>>>>>>>>>
>>>>>>>>>>diff --git a/drivers/dpll/dpll_netlink.c
>>>>>>>>>>b/drivers/dpll/dpll_netlink.c
>>>>>>>>>>index 93fc6c4b8a78..963bbbbe6660 100644
>>>>>>>>>>--- a/drivers/dpll/dpll_netlink.c
>>>>>>>>>>+++ b/drivers/dpll/dpll_netlink.c
>>>>>>>>>>@@ -278,17 +278,17 @@ dpll_msg_add_pin_freq(struct sk_buff *msg,
>>>>>>>>>>struct
>>>>>>>>>>dpll_pin *pin,
>>>>>>>>>> 	if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY, sizeof(freq),
>>>>>>>>>>&freq,
>>>>>>>>>> 			  DPLL_A_PIN_PAD))
>>>>>>>>>> 		return -EMSGSIZE;
>>>>>>>>>>-	for (fs = 0; fs < pin->prop->freq_supported_num; fs++) {
>>>>>>>>>>+	for (fs = 0; fs < pin->prop.freq_supported_num; fs++) {
>>>>>>>>>> 		nest = nla_nest_start(msg,
>>>>>>>>>>DPLL_A_PIN_FREQUENCY_SUPPORTED);
>>>>>>>>>> 		if (!nest)
>>>>>>>>>> 			return -EMSGSIZE;
>>>>>>>>>>-		freq = pin->prop->freq_supported[fs].min;
>>>>>>>>>>+		freq = pin->prop.freq_supported[fs].min;
>>>>>>>>>> 		if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY_MIN,
>>>>>>>>>>sizeof(freq),
>>>>>>>>>> 				  &freq, DPLL_A_PIN_PAD)) {
>>>>>>>>>> 			nla_nest_cancel(msg, nest);
>>>>>>>>>> 			return -EMSGSIZE;
>>>>>>>>>> 		}
>>>>>>>>>>-		freq = pin->prop->freq_supported[fs].max;
>>>>>>>>>>+		freq = pin->prop.freq_supported[fs].max;
>>>>>>>>>> 		if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY_MAX,
>>>>>>>>>>sizeof(freq),
>>>>>>>>>> 				  &freq, DPLL_A_PIN_PAD)) {
>>>>>>>>>> 			nla_nest_cancel(msg, nest);
>>>>>>>>>>@@ -304,9 +304,9 @@ static bool dpll_pin_is_freq_supported(struct
>>>>>>>>>>dpll_pin
>>>>>>>>>>*pin, u32 freq)
>>>>>>>>>> {
>>>>>>>>>> 	int fs;
>>>>>>>>>>
>>>>>>>>>>-	for (fs = 0; fs < pin->prop->freq_supported_num; fs++)
>>>>>>>>>>-		if (freq >= pin->prop->freq_supported[fs].min &&
>>>>>>>>>>-		    freq <= pin->prop->freq_supported[fs].max)
>>>>>>>>>>+	for (fs = 0; fs < pin->prop.freq_supported_num; fs++)
>>>>>>>>>>+		if (freq >= pin->prop.freq_supported[fs].min &&
>>>>>>>>>>+		    freq <= pin->prop.freq_supported[fs].max)
>>>>>>>>>> 			return true;
>>>>>>>>>> 	return false;
>>>>>>>>>> }
>>>>>>>>>>@@ -403,7 +403,7 @@ static int
>>>>>>>>>> dpll_cmd_pin_get_one(struct sk_buff *msg, struct dpll_pin *pin,
>>>>>>>>>> 		     struct netlink_ext_ack *extack)
>>>>>>>>>> {
>>>>>>>>>>-	const struct dpll_pin_properties *prop = pin->prop;
>>>>>>>>>>+	const struct dpll_pin_properties *prop = &pin->prop;
>>>>>>>>>> 	struct dpll_pin_ref *ref;
>>>>>>>>>> 	int ret;
>>>>>>>>>>
>>>>>>>>>>@@ -696,7 +696,7 @@ dpll_pin_on_pin_state_set(struct dpll_pin
>>>>>>>>>>*pin,
>>>>>>>>>>u32
>>>>>>>>>>parent_idx,
>>>>>>>>>> 	int ret;
>>>>>>>>>>
>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE &
>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "state changing is not allowed");
>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>> 	}
>>>>>>>>>>@@ -732,7 +732,7 @@ dpll_pin_state_set(struct dpll_device *dpll,
>>>>>>>>>>struct
>>>>>>>>>>dpll_pin *pin,
>>>>>>>>>> 	int ret;
>>>>>>>>>>
>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE &
>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "state changing is not allowed");
>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>> 	}
>>>>>>>>>>@@ -759,7 +759,7 @@ dpll_pin_prio_set(struct dpll_device *dpll,
>>>>>>>>>>struct
>>>>>>>>>>dpll_pin *pin,
>>>>>>>>>> 	int ret;
>>>>>>>>>>
>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_PRIORITY_CAN_CHANGE &
>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "prio changing is not allowed");
>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>> 	}
>>>>>>>>>>@@ -787,7 +787,7 @@ dpll_pin_direction_set(struct dpll_pin *pin,
>>>>>>>>>>struct
>>>>>>>>>>dpll_device *dpll,
>>>>>>>>>> 	int ret;
>>>>>>>>>>
>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_DIRECTION_CAN_CHANGE &
>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "direction changing is not
>>>>>>>>>>allowed");
>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>> 	}
>>>>>>>>>>@@ -817,8 +817,8 @@ dpll_pin_phase_adj_set(struct dpll_pin *pin,
>>>>>>>>>>struct
>>>>>>>>>>nlattr *phase_adj_attr,
>>>>>>>>>> 	int ret;
>>>>>>>>>>
>>>>>>>>>> 	phase_adj = nla_get_s32(phase_adj_attr);
>>>>>>>>>>-	if (phase_adj > pin->prop->phase_range.max ||
>>>>>>>>>>-	    phase_adj < pin->prop->phase_range.min) {
>>>>>>>>>>+	if (phase_adj > pin->prop.phase_range.max ||
>>>>>>>>>>+	    phase_adj < pin->prop.phase_range.min) {
>>>>>>>>>> 		NL_SET_ERR_MSG_ATTR(extack, phase_adj_attr,
>>>>>>>>>> 				    "phase adjust value not supported");
>>>>>>>>>> 		return -EINVAL;
>>>>>>>>>>@@ -999,7 +999,7 @@ dpll_pin_find(u64 clock_id, struct nlattr
>>>>>>>>>>*mod_name_attr,
>>>>>>>>>> 	unsigned long i;
>>>>>>>>>>
>>>>>>>>>> 	xa_for_each_marked(&dpll_pin_xa, i, pin, DPLL_REGISTERED) {
>>>>>>>>>>-		prop = pin->prop;
>>>>>>>>>>+		prop = &pin->prop;
>>>>>>>>>> 		cid_match = clock_id ? pin->clock_id == clock_id : true;
>>>>>>>>>> 		mod_match = mod_name_attr && module_name(pin->module) ?
>>>>>>>>>> 			!nla_strcmp(mod_name_attr,
>>>>>>>>>>--
>>>>>>>>>>2.38.1
>>>>>>>>>>
>>>>>>>>
>>>>

^ permalink raw reply

* [PATCH v2] Fixes the null pointer deferences in nsim_bpf
From: Dipendra Khadka @ 2023-11-10 11:18 UTC (permalink / raw)
  To: kuba, davem, edumazet, pabeni
  Cc: Dipendra Khadka, netdev, linux-kernel, linux-kernel-mentees,
	syzbot+44c2416196b7c607f226
In-Reply-To: <20231110084425.2123-1-kdipendra88@gmail.com>

Syzkaller found a null pointer dereference in nsim_bpf
originating from the lack of a null check for state.

This patch fixes the issue by adding a check for state
in two functions nsim_prog_set_loaded() and nsim_setup_prog_hw_checks()

Reported-by: syzbot+44c2416196b7c607f226@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com./bug?extid=44c2416196b7c607f226
Fixes: 31d3ad832948 ("netdevsim: add bpf offload support")
Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com>
---
 drivers/net/netdevsim/bpf.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index f60eb97e3a62..5d755da3c736 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -97,7 +97,8 @@ static void nsim_prog_set_loaded(struct bpf_prog *prog, bool loaded)
 		return;
 
 	state = prog->aux->offload->dev_priv;
-	state->is_loaded = loaded;
+	if (state)
+		state->is_loaded = loaded;
 }
 
 static int
@@ -317,9 +318,11 @@ nsim_setup_prog_hw_checks(struct netdevsim *ns, struct netdev_bpf *bpf)
 	}
 
 	state = bpf->prog->aux->offload->dev_priv;
-	if (WARN_ON(strcmp(state->state, "xlated"))) {
-		NSIM_EA(bpf->extack, "offloading program in bad state");
-		return -EINVAL;
+	if (state) {
+		if (WARN_ON(strcmp(state->state, "xlated"))) {
+			NSIM_EA(bpf->extack, "offloading program in bad state");
+			return -EINVAL;
+		}
 	}
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related

* RE: [PATCH net 3/3] dpll: fix register pin with unregistered parent pin
From: Kubalewski, Arkadiusz @ 2023-11-10 11:19 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Vadim Fedorenko, netdev@vger.kernel.org, Michalik, Michal,
	Olech, Milena, pabeni@redhat.com, kuba@kernel.org
In-Reply-To: <ZU4A1FG0+JgVz3HF@nanopsycho>

>From: Jiri Pirko <jiri@resnulli.us>
>Sent: Friday, November 10, 2023 11:07 AM
>
>Fri, Nov 10, 2023 at 09:50:34AM CET, arkadiusz.kubalewski@intel.com wrote:
>>>From: Jiri Pirko <jiri@resnulli.us>
>>>Sent: Friday, November 10, 2023 7:48 AM
>>>
>>>Fri, Nov 10, 2023 at 12:21:11AM CET, arkadiusz.kubalewski@intel.com
>>>wrote:
>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>Sent: Thursday, November 9, 2023 7:04 PM
>>>>>
>>>>>Thu, Nov 09, 2023 at 05:02:48PM CET, arkadiusz.kubalewski@intel.com
>>>>>wrote:
>>>>>>>From: Vadim Fedorenko <vadim.fedorenko@linux.dev>
>>>>>>>Sent: Thursday, November 9, 2023 11:56 AM
>>>>>>>To: Kubalewski, Arkadiusz <arkadiusz.kubalewski@intel.com>; Jiri
>>>>>>>Pirko
>>>>>>>
>>>>>>>On 09/11/2023 09:59, Kubalewski, Arkadiusz wrote:
>>>>>>>>> From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>>> Sent: Wednesday, November 8, 2023 4:08 PM
>>>>>>>>>
>>>>>>>>> Wed, Nov 08, 2023 at 11:32:26AM CET,
>>>>>>>>>arkadiusz.kubalewski@intel.com
>>>>>>>>> wrote:
>>>>>>>>>> In case of multiple kernel module instances using the same dpll
>>>>>>>>>>device:
>>>>>>>>>> if only one registers dpll device, then only that one can
>>>>>>>>>>register
>>>>>>>>>
>>>>>>>>> They why you don't register in multiple instances? See mlx5 for a
>>>>>>>>> reference.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Every registration requires ops, but for our case only PF0 is able
>>>>>>>> to
>>>>>>>> control dpll pins and device, thus only this can provide ops.
>>>>>>>> Basically without PF0, dpll is not able to be controlled, as well
>>>>>>>> as directly connected pins.
>>>>>>>>
>>>>>>>But why do you need other pins then, if FP0 doesn't exist?
>>>>>>>
>>>>>>
>>>>>>In general we don't need them at that point, but this is a corner
>>>>>>case,
>>>>>>where users for some reason decided to unbind PF 0, and I treat this
>>>>>>state
>>>>>>as temporary, where dpll/pins controllability is temporarily broken.
>>>>>
>>>>>So resolve this broken situation internally in the driver, registering
>>>>>things only in case PF0 is present. Some simple notification infra
>>>>>would
>>>>>do. Don't drag this into the subsystem internals.
>>>>>
>>>>
>>>>Thanks for your feedback, but this is already wrong advice.
>>>>
>>>>Our HW/FW is designed in different way than yours, it doesn't mean it is
>>>>wrong.
>>>>As you might recall from our sync meetings, the dpll subsystem is to
>>>>unify
>>>>approaches and reduce the code in the drivers, where your advice is
>>>>exactly
>>>
>>>No. Your driver knows when what objects are valid or not. Of a pin of
>>>PF1 is not valid because the "master" PF0 is gone, it is responsibility
>>>of your driver to resolve that. Don't bring this internal dependencies
>>>to the dpll core please, does not make any sense to do so. Thanks!
>>>
>>
>>No, a driver doesn't know it, those are separated instances, and you
>>already
>>suggested to implement special notification bus in the driver.
>>This is not needed and prone for another errors. The dpll subsystem is
>>here to
>>make driver life easier.
>
>See the other thread for my reply.
>

Ok, will do.

Thank you!
Arkadiusz

>>
>>Thank you!
>>Arkadiusz
>>
>>>
>>>>opposite, suggested fix would require to implement extra synchronization
>>>>of the
>>>>dpll and pin registration state between driver instances, most probably
>>>>with
>>>>use of additional modules like aux-bus or something similar, which was
>>>>from the
>>>>very beginning something we tried to avoid.
>>>>Only ice uses the infrastructure of muxed pins, and this is broken as it
>>>>doesn't allow unbind the driver which have registered dpll and pins
>>>>without
>>>>crashing the kernel, so a fix is required in dpll subsystem, not in the
>>>>driver.
>>>>
>>>>Thank you!
>>>>Arkadiusz
>>>>
>>>>>
>>>>>>
>>>>>>The dpll at that point is not registered, all the direct pins are also
>>>>>>not registered, thus not available to the users.
>>>>>>
>>>>>>When I do dump at that point there are still 3 pins present, one for
>>>>>>each
>>>>>>PF, although they are all zombies - no parents as their parent pins
>>>>>>are
>>>>>>not
>>>>>>registered (as the other patch [1/3] prevents dump of pin parent if
>>>>>>the
>>>>>>parent is not registered). Maybe we can remove the REGISTERED mark for
>>>>>>all
>>>>>>the muxed pins, if all their parents have been unregistered, so they
>>>>>>won't
>>>>>>be visible to the user at all. Will try to POC that.
>>>>>>
>>>>>>>>>
>>>>>>>>>> directly connected pins with a dpll device. If unregistered
>>>>>>>>>>parent
>>>>>>>>>> determines if the muxed pin can be register with it or not, it
>>>>>>>>>>forces
>>>>>>>>>> serialized driver load order - first the driver instance which
>>>>>>>>>> registers the direct pins needs to be loaded, then the other
>>>>>>>>>> instances
>>>>>>>>>> could register muxed type pins.
>>>>>>>>>>
>>>>>>>>>> Allow registration of a pin with a parent even if the parent was
>>>>>>>>>>not
>>>>>>>>>> yet registered, thus allow ability for unserialized driver
>>>>>>>>>>instance
>>>>>>>>>
>>>>>>>>> Weird.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Yeah, this is issue only for MUX/parent pin part, couldn't find
>>>>>>>>better
>>>>>>>> way, but it doesn't seem to break things around..
>>>>>>>>
>>>>>>>
>>>>>>>I just wonder how do you see the registration procedure? How can
>>>>>>>parent
>>>>>>>pin exist if it's not registered? I believe you cannot get it through
>>>>>>>DPLL API, then the only possible way is to create it within the same
>>>>>>>driver code, which can be simply re-arranged. Am I wrong here?
>>>>>>>
>>>>>>
>>>>>>By "parent exist" I mean the parent pin exist in the dpll subsystem
>>>>>>(allocated on pins xa), but it doesn't mean it is available to the
>>>>>>users,
>>>>>>as it might not be registered with a dpll device.
>>>>>>
>>>>>>We have this 2 step init approach:
>>>>>>1. dpll_pin_get(..) -> allocate new pin or increase reference if exist
>>>>>>2.1. dpll_pin_register(..) -> register with a dpll device
>>>>>>2.2. dpll_pin_on_pin_register -> register with a parent pin
>>>>>>
>>>>>>Basically:
>>>>>>- PF 0 does 1 & 2.1 for all the direct inputs, and steps: 1 & 2.2 for
>>>>>>its
>>>>>>  recovery clock pin,
>>>>>>- other PF's only do step 1 for the direct input pins (as they must
>>>>>>get
>>>>>>  reference to those in order to register recovery clock pin with
>>>>>>them),
>>>>>>  and steps: 1 & 2.2 for their recovery clock pin.
>>>>>>
>>>>>>
>>>>>>Thank you!
>>>>>>Arkadiusz
>>>>>>
>>>>>>>> Thank you!
>>>>>>>> Arkadiusz
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> load order.
>>>>>>>>>> Do not WARN_ON notification for unregistered pin, which can be
>>>>>>>>>> invoked
>>>>>>>>>> for described case, instead just return error.
>>>>>>>>>>
>>>>>>>>>> Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base
>>>>>>>>>>functions")
>>>>>>>>>> Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base
>>>>>>>>>> functions")
>>>>>>>>>> Signed-off-by: Arkadiusz Kubalewski
>>>>>>>>>><arkadiusz.kubalewski@intel.com>
>>>>>>>>>> ---
>>>>>>>>>> drivers/dpll/dpll_core.c    | 4 ----
>>>>>>>>>> drivers/dpll/dpll_netlink.c | 2 +-
>>>>>>>>>> 2 files changed, 1 insertion(+), 5 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c
>>>>>>>>>> index
>>>>>>>>>> 4077b562ba3b..ae884b92d68c 100644
>>>>>>>>>> --- a/drivers/dpll/dpll_core.c
>>>>>>>>>> +++ b/drivers/dpll/dpll_core.c
>>>>>>>>>> @@ -28,8 +28,6 @@ static u32 dpll_xa_id;
>>>>>>>>>> 	WARN_ON_ONCE(!xa_get_mark(&dpll_device_xa, (d)->id,
>>>>>>>>>> DPLL_REGISTERED))
>>>>>>>>>> #define ASSERT_DPLL_NOT_REGISTERED(d)	\
>>>>>>>>>> 	WARN_ON_ONCE(xa_get_mark(&dpll_device_xa, (d)->id,
>>>>>>>>>> DPLL_REGISTERED))
>>>>>>>>>> -#define ASSERT_PIN_REGISTERED(p)	\
>>>>>>>>>> -	WARN_ON_ONCE(!xa_get_mark(&dpll_pin_xa, (p)->id,
>>>>>>>>>> DPLL_REGISTERED))
>>>>>>>>>>
>>>>>>>>>> struct dpll_device_registration {
>>>>>>>>>> 	struct list_head list;
>>>>>>>>>> @@ -641,8 +639,6 @@ int dpll_pin_on_pin_register(struct dpll_pin
>>>>>>>>>> *parent,
>>>>>>>>>> struct dpll_pin *pin,
>>>>>>>>>> 	    WARN_ON(!ops->state_on_pin_get) ||
>>>>>>>>>> 	    WARN_ON(!ops->direction_get))
>>>>>>>>>> 		return -EINVAL;
>>>>>>>>>> -	if (ASSERT_PIN_REGISTERED(parent))
>>>>>>>>>> -		return -EINVAL;
>>>>>>>>>>
>>>>>>>>>> 	mutex_lock(&dpll_lock);
>>>>>>>>>> 	ret = dpll_xa_ref_pin_add(&pin->parent_refs, parent, ops,
>>>>>>>>>> priv); diff
>>>>>>>>>> --git a/drivers/dpll/dpll_netlink.c b/drivers/dpll/dpll_netlink.c
>>>>>>>>>> index
>>>>>>>>>> 963bbbbe6660..ff430f43304f 100644
>>>>>>>>>> --- a/drivers/dpll/dpll_netlink.c
>>>>>>>>>> +++ b/drivers/dpll/dpll_netlink.c
>>>>>>>>>> @@ -558,7 +558,7 @@ dpll_pin_event_send(enum dpll_cmd event,
>>>>>>>>>>struct
>>>>>>>>>> dpll_pin *pin)
>>>>>>>>>> 	int ret = -ENOMEM;
>>>>>>>>>> 	void *hdr;
>>>>>>>>>>
>>>>>>>>>> -	if (WARN_ON(!xa_get_mark(&dpll_pin_xa, pin->id,
>>>>>>>>>> DPLL_REGISTERED)))
>>>>>>>>>> +	if (!xa_get_mark(&dpll_pin_xa, pin->id, DPLL_REGISTERED))
>>>>>>>>>> 		return -ENODEV;
>>>>>>>>>>
>>>>>>>>>> 	msg = genlmsg_new(NLMSG_GOODSIZE, GFP_KERNEL);
>>>>>>>>>> --
>>>>>>>>>> 2.38.1
>>>>>>>>>>
>>>>>>>
>>>>>>


^ permalink raw reply

* RE: [PATCH net 0/3] dpll: fix unordered unbind/bind registerer issues
From: Kubalewski, Arkadiusz @ 2023-11-10 11:22 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Vadim Fedorenko, netdev@vger.kernel.org, Michalik, Michal,
	Olech, Milena, pabeni@redhat.com, kuba@kernel.org
In-Reply-To: <ZU4BVdPvAwKer+3v@nanopsycho>

>From: Jiri Pirko <jiri@resnulli.us>
>Sent: Friday, November 10, 2023 11:09 AM
>
>Fri, Nov 10, 2023 at 10:06:59AM CET, arkadiusz.kubalewski@intel.com wrote:
>>>From: Jiri Pirko <jiri@resnulli.us>
>>>Sent: Friday, November 10, 2023 7:49 AM
>>>
>>>Fri, Nov 10, 2023 at 12:35:43AM CET, arkadiusz.kubalewski@intel.com
>>>wrote:
>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>Sent: Thursday, November 9, 2023 7:07 PM
>>>>>
>>>>>Thu, Nov 09, 2023 at 06:20:14PM CET, arkadiusz.kubalewski@intel.com
>>>>>wrote:
>>>>>>>From: Vadim Fedorenko <vadim.fedorenko@linux.dev>
>>>>>>>Sent: Thursday, November 9, 2023 11:51 AM
>>>>>>>
>>>>>>>On 08/11/2023 10:32, Arkadiusz Kubalewski wrote:
>>>>>>>> Fix issues when performing unordered unbind/bind of a kernel
>>>>>>>>modules
>>>>>>>> which are using a dpll device with DPLL_PIN_TYPE_MUX pins.
>>>>>>>> Currently only serialized bind/unbind of such use case works, fix
>>>>>>>> the issues and allow for unserialized kernel module bind order.
>>>>>>>>
>>>>>>>> The issues are observed on the ice driver, i.e.,
>>>>>>>>
>>>>>>>> $ echo 0000:af:00.0 > /sys/bus/pci/drivers/ice/unbind
>>>>>>>> $ echo 0000:af:00.1 > /sys/bus/pci/drivers/ice/unbind
>>>>>>>>
>>>>>>>> results in:
>>>>>>>>
>>>>>>>> ice 0000:af:00.0: Removed PTP clock
>>>>>>>> BUG: kernel NULL pointer dereference, address: 0000000000000010
>>>>>>>> PF: supervisor read access in kernel mode
>>>>>>>> PF: error_code(0x0000) - not-present page
>>>>>>>> PGD 0 P4D 0
>>>>>>>> Oops: 0000 [#1] PREEMPT SMP PTI
>>>>>>>> CPU: 7 PID: 71848 Comm: bash Kdump: loaded Not tainted 6.6.0-
>>>>>>>>rc5_next-
>>>>>>>>queue_19th-Oct-2023-01625-g039e5d15e451 #1
>>>>>>>> Hardware name: Intel Corporation S2600STB/S2600STB, BIOS
>>>>>>>>SE5C620.86B.02.01.0008.031920191559 03/19/2019
>>>>>>>> RIP: 0010:ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
>>>>>>>> Code: 41 57 4d 89 cf 41 56 41 55 4d 89 c5 41 54 55 48 89 f5 53 4c
>>>>>>>>8b
>>>>>>>>66
>>>>>>>>08 48 89 cb 4d 8d b4 24 f0 49 00 00 4c 89 f7 e8 71 ec 1f c5 <0f> b6
>>>>>>>>5b
>>>>>>>>10
>>>>>>>>41 0f b6 84 24 30 4b 00 00 29 c3 41 0f b6 84 24 28 4b
>>>>>>>> RSP: 0018:ffffc902b179fb60 EFLAGS: 00010246
>>>>>>>> RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
>>>>>>>> RDX: ffff8882c1398000 RSI: ffff888c7435cc60 RDI: ffff888c7435cb90
>>>>>>>> RBP: ffff888c7435cc60 R08: ffffc902b179fbb0 R09: 0000000000000000
>>>>>>>> R10: ffff888ef1fc8050 R11: fffffffffff82700 R12: ffff888c743581a0
>>>>>>>> R13: ffffc902b179fbb0 R14: ffff888c7435cb90 R15: 0000000000000000
>>>>>>>> FS:  00007fdc7dae0740(0000) GS:ffff888c105c0000(0000)
>>>>>>>>knlGS:0000000000000000
>>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>>>>> CR2: 0000000000000010 CR3: 0000000132c24002 CR4: 00000000007706e0
>>>>>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>>>>>> PKRU: 55555554
>>>>>>>> Call Trace:
>>>>>>>>   <TASK>
>>>>>>>>   ? __die+0x20/0x70
>>>>>>>>   ? page_fault_oops+0x76/0x170
>>>>>>>>   ? exc_page_fault+0x65/0x150
>>>>>>>>   ? asm_exc_page_fault+0x22/0x30
>>>>>>>>   ? ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
>>>>>>>>   ? __pfx_ice_dpll_rclk_state_on_pin_get+0x10/0x10 [ice]
>>>>>>>>   dpll_msg_add_pin_parents+0x142/0x1d0
>>>>>>>>   dpll_pin_event_send+0x7d/0x150
>>>>>>>>   dpll_pin_on_pin_unregister+0x3f/0x100
>>>>>>>>   ice_dpll_deinit_pins+0xa1/0x230 [ice]
>>>>>>>>   ice_dpll_deinit+0x29/0xe0 [ice]
>>>>>>>>   ice_remove+0xcd/0x200 [ice]
>>>>>>>>   pci_device_remove+0x33/0xa0
>>>>>>>>   device_release_driver_internal+0x193/0x200
>>>>>>>>   unbind_store+0x9d/0xb0
>>>>>>>>   kernfs_fop_write_iter+0x128/0x1c0
>>>>>>>>   vfs_write+0x2bb/0x3e0
>>>>>>>>   ksys_write+0x5f/0xe0
>>>>>>>>   do_syscall_64+0x59/0x90
>>>>>>>>   ? filp_close+0x1b/0x30
>>>>>>>>   ? do_dup2+0x7d/0xd0
>>>>>>>>   ? syscall_exit_work+0x103/0x130
>>>>>>>>   ? syscall_exit_to_user_mode+0x22/0x40
>>>>>>>>   ? do_syscall_64+0x69/0x90
>>>>>>>>   ? syscall_exit_work+0x103/0x130
>>>>>>>>   ? syscall_exit_to_user_mode+0x22/0x40
>>>>>>>>   ? do_syscall_64+0x69/0x90
>>>>>>>>   entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>>>>>>>> RIP: 0033:0x7fdc7d93eb97
>>>>>>>> Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3
>>>>>>>>0f
>>>>>>>>1e
>>>>>>>>fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d
>>>>>>>>00
>>>>>>>>f0
>>>>>>>>ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>>>>>>> RSP: 002b:00007fff2aa91028 EFLAGS: 00000246 ORIG_RAX:
>>>>>>>>0000000000000001
>>>>>>>> RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fdc7d93eb97
>>>>>>>> RDX: 000000000000000d RSI: 00005644814ec9b0 RDI: 0000000000000001
>>>>>>>> RBP: 00005644814ec9b0 R08: 0000000000000000 R09: 00007fdc7d9b14e0
>>>>>>>> R10: 00007fdc7d9b13e0 R11: 0000000000000246 R12: 000000000000000d
>>>>>>>> R13: 00007fdc7d9fb780 R14: 000000000000000d R15: 00007fdc7d9f69e0
>>>>>>>>   </TASK>
>>>>>>>> Modules linked in: uinput vfio_pci vfio_pci_core vfio_iommu_type1
>>>>>>>>vfio
>>>>>>>>irqbypass ixgbevf snd_seq_dummy snd_hrtimer snd_seq snd_timer
>>>>>>>>snd_seq_device snd soundcore overlay qrtr rfkill vfat fat xfs
>>>>>>>>libcrc32c
>>>>>>>>rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod
>>>>>>>>target_core_mod
>>>>>>>>ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm
>>>>>>>>intel_rapl_msr
>>>>>>>>intel_rapl_common intel_uncore_frequency
>>>>>>>>intel_uncore_frequency_common
>>>>>>>>isst_if_common skx_edac nfit libnvdimm ipmi_ssif
>>>>>>>>x86_pkg_temp_thermal
>>>>>>>>intel_powerclamp coretemp irdma rapl intel_cstate ib_uverbs iTCO_wdt
>>>>>>>>iTCO_vendor_support acpi_ipmi intel_uncore mei_me ipmi_si pcspkr
>>>>>>>>i2c_i801
>>>>>>>>ib_core mei ipmi_devintf intel_pch_thermal ioatdma i2c_smbus
>>>>>>>>ipmi_msghandler lpc_ich joydev acpi_power_meter acpi_pad ext4
>>>>>>>>mbcache
>>>>>>>>jbd2
>>>>>>>>sd_mod t10_pi sg ast i2c_algo_bit drm_shmem_helper drm_kms_helper
>>>>>>>>ice
>>>>>>>>crct10dif_pclmul ixgbe crc32_pclmul drm crc32c_intel ahci i40e
>>>>>>>>libahci
>>>>>>>>ghash_clmulni_intel libata mdio dca gnss wmi fuse [last unloaded:
>>>>>>>>iavf]
>>>>>>>> CR2: 0000000000000010
>>>>>>>>
>>>>>>>> Arkadiusz Kubalewski (3):
>>>>>>>>    dpll: fix pin dump crash after module unbind
>>>>>>>>    dpll: fix pin dump crash for rebound module
>>>>>>>>    dpll: fix register pin with unregistered parent pin
>>>>>>>>
>>>>>>>>   drivers/dpll/dpll_core.c    |  8 ++------
>>>>>>>>   drivers/dpll/dpll_core.h    |  4 ++--
>>>>>>>>   drivers/dpll/dpll_netlink.c | 37 ++++++++++++++++++++++----------
>>>>>>>>--
>>>>>>>>--
>>>>>>>>-
>>>>>>>>   3 files changed, 26 insertions(+), 23 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>I still don't get how can we end up with unregistered pin. And
>>>>>>>shouldn't
>>>>>>>drivers do unregister of dpll/pin during release procedure? I thought
>>>>>>>it
>>>>>>>was kind of agreement we reached while developing the subsystem.
>>>>>>>
>>>>>>
>>>>>>It's definitely not about ending up with unregistered pins.
>>>>>>
>>>>>>Usually the driver is loaded for PF0, PF1, PF2, PF3 and unloaded in
>>>>>>opposite
>>>>>>order: PF3, PF2, PF1, PF0. And this is working without any issues.
>>>>>
>>>>>Please fix this in the driver.
>>>>>
>>>>
>>>>Thanks for your feedback, but this is already wrong advice.
>>>>
>>>>Our HW/FW is designed in different way than yours, it doesn't mean it is
>>>>wrong.
>>>>As you might recall from our sync meetings, the dpll subsystem is to
>>>>unify
>>>>approaches and reduce the code in the drivers, where your advice is
>>>>exactly
>>>>opposite, suggested fix would require to implement extra synchronization
>>>>of the
>>>>dpll and pin registration state between driver instances, most probably
>>>>with
>>>>use of additional modules like aux-bus or something similar, which was
>>>>from the
>>>>very beginning something we tried to avoid.
>>>>Only ice uses the infrastructure of muxed pins, and this is broken as it
>>>>doesn't allow unbind the driver which have registered dpll and pins
>>>>without
>>>>crashing the kernel, so a fix is required in dpll subsystem, not in the
>>>>driver.
>>>
>>>I replied in the other patch thread.
>>>
>>
>>Yes, so did I.
>>But what is the reason you have moved the discussion from the other thread
>>into this one?
>
>I didn't, not sure why you say so. I just wanted to make sure you
>follow.
>

Funny thing, you said you did not, and just after explained why.

Thank you!
Arkadiusz

>>
>>Thank you!
>>Arkadiusz
>>
>>>
>>>>
>>>>Thank you!
>>>>Arkadiusz
>>>>
>>>>>
>>>>>>
>>>>>>Above crash is caused because of unordered driver unload, where dpll
>>>>>>subsystem
>>>>>>tries to notify muxed pin was deleted, but at that time the parent is
>>>>>>already
>>>>>>gone, thus data points to memory which is no longer available, thus
>>>>>>crash
>>>>>>happens when trying to dump pin parents.
>>>>>>
>>>>>>This series fixes all issues I could find connected to the situation
>>>>>>where
>>>>>>muxed-pins are trying to access their parents, when parent registerer
>>>>>>was
>>>>>>removed
>>>>>>in the meantime.
>>>>>>
>>>>>>Thank you!
>>>>>>Arkadiusz

^ permalink raw reply

* Re: [PATCH net 2/3] dpll: fix pin dump crash for rebound module
From: Jiri Pirko @ 2023-11-10 11:44 UTC (permalink / raw)
  To: Kubalewski, Arkadiusz
  Cc: netdev@vger.kernel.org, vadim.fedorenko@linux.dev,
	Michalik, Michal, Olech, Milena, pabeni@redhat.com,
	kuba@kernel.org
In-Reply-To: <DM6PR11MB46578574D6DA2F311FB97CF09BAEA@DM6PR11MB4657.namprd11.prod.outlook.com>

Fri, Nov 10, 2023 at 12:18:51PM CET, arkadiusz.kubalewski@intel.com wrote:
>>From: Jiri Pirko <jiri@resnulli.us>
>>Sent: Friday, November 10, 2023 11:06 AM
>>
>>Fri, Nov 10, 2023 at 10:01:50AM CET, arkadiusz.kubalewski@intel.com wrote:
>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>Sent: Friday, November 10, 2023 7:46 AM
>>>>
>>>>Fri, Nov 10, 2023 at 12:32:21AM CET, arkadiusz.kubalewski@intel.com
>>>>wrote:
>>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>>Sent: Thursday, November 9, 2023 7:06 PM
>>>>>>
>>>>>>Thu, Nov 09, 2023 at 05:30:20PM CET, arkadiusz.kubalewski@intel.com
>>>>>>wrote:
>>>>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>>Sent: Thursday, November 9, 2023 2:19 PM
>>>>>>>>
>>>>>>>>Thu, Nov 09, 2023 at 01:20:48PM CET, arkadiusz.kubalewski@intel.com
>>>>>>>>wrote:
>>>>>>>>>>From: Jiri Pirko <jiri@resnulli.us>
>>>>>>>>>>Sent: Wednesday, November 8, 2023 3:30 PM
>>>>>>>>>>
>>>>>>>>>>Wed, Nov 08, 2023 at 11:32:25AM CET, arkadiusz.kubalewski@intel.com
>>>>>>>>>>wrote:
>>>>>>>>>>>When a kernel module is unbound but the pin resources were not
>>>>>>>>>>>entirely
>>>>>>>>>>>freed (other kernel module instance have had kept the reference to
>>>>>>>>>>>that
>>>>>>>>>>>pin), and kernel module is again bound, the pin properties would
>>>>>>>>>>>not
>>>>>>>>>>>be
>>>>>>>>>>>updated (the properties are only assigned when memory for the pin
>>>>>>>>>>>is
>>>>>>>>>>>allocated), prop pointer still points to the kernel module memory
>>>>>>>>>>>of
>>>>>>>>>>>the kernel module which was deallocated on the unbind.
>>>>>>>>>>>
>>>>>>>>>>>If the pin dump is invoked in this state, the result is a kernel
>>>>>>>>>>>crash.
>>>>>>>>>>>Prevent the crash by storing persistent pin properties in dpll
>>>>>>>>>>>subsystem,
>>>>>>>>>>>copy the content from the kernel module when pin is allocated,
>>>>>>>>>>>instead
>>>>>>>>>>>of
>>>>>>>>>>>using memory of the kernel module.
>>>>>>>>>>>
>>>>>>>>>>>Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base
>>>>>>>>>>>functions")
>>>>>>>>>>>Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base
>>>>>>>>>>>functions")
>>>>>>>>>>>Signed-off-by: Arkadiusz Kubalewski
>>>>>>>>>>><arkadiusz.kubalewski@intel.com>
>>>>>>>>>>>---
>>>>>>>>>>> drivers/dpll/dpll_core.c    |  4 ++--
>>>>>>>>>>> drivers/dpll/dpll_core.h    |  4 ++--
>>>>>>>>>>> drivers/dpll/dpll_netlink.c | 28 ++++++++++++++--------------
>>>>>>>>>>> 3 files changed, 18 insertions(+), 18 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>>diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c
>>>>>>>>>>>index 3568149b9562..4077b562ba3b 100644
>>>>>>>>>>>--- a/drivers/dpll/dpll_core.c
>>>>>>>>>>>+++ b/drivers/dpll/dpll_core.c
>>>>>>>>>>>@@ -442,7 +442,7 @@ dpll_pin_alloc(u64 clock_id, u32 pin_idx,
>>>>>>>>>>>struct
>>>>>>>>>>>module *module,
>>>>>>>>>>> 		ret = -EINVAL;
>>>>>>>>>>> 		goto err;
>>>>>>>>>>> 	}
>>>>>>>>>>>-	pin->prop = prop;
>>>>>>>>>>>+	memcpy(&pin->prop, prop, sizeof(pin->prop));
>>>>>>>>>>
>>>>>>>>>>Odd, you don't care about the pointer within this structure?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Well, true. Need a fix.
>>>>>>>>>Wondering if copying idea is better than just assigning prop pointer
>>>>>>>>>on
>>>>>>>>>each call to dpll_pin_get(..) function (when pin already exists)?
>>>>>>>>
>>>>>>>>Not sure what do you mean. Examples please.
>>>>>>>>
>>>>>>>
>>>>>>>Sure,
>>>>>>>
>>>>>>>Basically this change:
>>>>>>>
>>>>>>>diff --git a/drivers/dpll/dpll_core.c b/drivers/dpll/dpll_core.c
>>>>>>>index ae884b92d68c..06b72d5877c3 100644
>>>>>>>--- a/drivers/dpll/dpll_core.c
>>>>>>>+++ b/drivers/dpll/dpll_core.c
>>>>>>>@@ -483,6 +483,7 @@ dpll_pin_get(u64 clock_id, u32 pin_idx, struct
>>>>>>>module
>>>>>>>*module,
>>>>>>>                    pos->pin_idx == pin_idx &&
>>>>>>>                    pos->module == module) {
>>>>>>>                        ret = pos;
>>>>>>>+                       pos->prop = prop;
>>>>>>>                        refcount_inc(&ret->refcount);
>>>>>>>                        break;
>>>>>>>                }
>>>>>>>
>>>>>>>would replace whole of this patch changes, although seems a bit hacky.
>>>>>>
>>>>>>Or event better, as I suggested in the other patch reply, resolve this
>>>>>>internally in the driver registering things only when they are valid.
>>>>>>Much better then to hack anything in dpll core.
>>>>>>
>>>>>
>>>>>This approach seemed to me hacky, that is why started with coping the
>>>>>data.
>>>>>It is not about registering, rather about unregistering on driver
>>>>>unbind, which brakes things, and currently cannot be recovered in
>>>>>described case.
>>>>
>>>>Sure it can. PF0 unbind-> internal notification-> PF1 unregisters all
>>>>related object. Very clean and simple.
>>>>
>>>
>>>What you are suggesting is:
>>>- special purpose bus in the driver,
>>
>>No, it is a simple notificator. Very common infra over the whole
>>kernel code.
>>
>
>No difference if this is simple or not, it is special purpose.
>
>>
>>>- dpll-related,
>>
>>Is this the only thing that PF0 is special with? Perhaps you can
>>utilize this for other features as well, since your fw design is like
>>this.
>>
>
>None api user is allowed to the unordered driver bind when there are muxed
>pins involved.
>This requires a fix in the api not in the driver.
>
>>
>>>- not needed,
>>>- prone for errors.
>>>
>>>The dpll subsystem is here to make driver life easier.
>>
>>No, the subsystem is never here to handle device specific issues. And
>>your PF0 dependency is very clearly something device specific. Don't
>>pollute the dpll subsystem with workaround to handle specific device
>>needs. create/register the dplls objects from your driver only when it
>>is valid to do so. Make sure the lifetime of such object stays in
>>the scope of validity. Handle that in the driver. Very clear and simple.
>>
>
>In our case this is PF0 but this is broader issue. The muxed pins
>infrastructure in now slightly broken and I am fixing it.
>From the beginning it was designed to allow separated driver instances
>to create a device and connect their pins with it.

That is true. That's exacly what we have implemented in mlx5. Each
instance registers dpll device and the pin related to the instance. No
problem.

The fact that you do register only in PF0 is a limitation in your
driver. Fix it there.


>Any driver which would use it would face this issue. What you are trying
>to imply is that it is better to put traffic lights on each car instead
>of putting them on crossroads.
>
>>Thanks!
>>
>
>Thank you!
>Arkadiusz
>
>>
>>>
>>>Thank you!
>>>Arkadiusz
>>>
>>>>
>>>>>
>>>>>Thank you!
>>>>>Arkadiusz
>>>>>
>>>>>>
>>>>>>>
>>>>>>>Thank you!
>>>>>>>Arkadiusz
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>Thank you!
>>>>>>>>>Arkadiusz
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 	refcount_set(&pin->refcount, 1);
>>>>>>>>>>> 	xa_init_flags(&pin->dpll_refs, XA_FLAGS_ALLOC);
>>>>>>>>>>> 	xa_init_flags(&pin->parent_refs, XA_FLAGS_ALLOC);
>>>>>>>>>>>@@ -634,7 +634,7 @@ int dpll_pin_on_pin_register(struct dpll_pin
>>>>>>>>>>>*parent,
>>>>>>>>>>>struct dpll_pin *pin,
>>>>>>>>>>> 	unsigned long i, stop;
>>>>>>>>>>> 	int ret;
>>>>>>>>>>>
>>>>>>>>>>>-	if (WARN_ON(parent->prop->type != DPLL_PIN_TYPE_MUX))
>>>>>>>>>>>+	if (WARN_ON(parent->prop.type != DPLL_PIN_TYPE_MUX))
>>>>>>>>>>> 		return -EINVAL;
>>>>>>>>>>>
>>>>>>>>>>> 	if (WARN_ON(!ops) ||
>>>>>>>>>>>diff --git a/drivers/dpll/dpll_core.h b/drivers/dpll/dpll_core.h
>>>>>>>>>>>index 5585873c5c1b..717f715015c7 100644
>>>>>>>>>>>--- a/drivers/dpll/dpll_core.h
>>>>>>>>>>>+++ b/drivers/dpll/dpll_core.h
>>>>>>>>>>>@@ -44,7 +44,7 @@ struct dpll_device {
>>>>>>>>>>>  * @module:		module of creator
>>>>>>>>>>>  * @dpll_refs:		hold referencees to dplls pin was
>>>>>>>>>>>registered
>>>>>>>>>>>with
>>>>>>>>>>>  * @parent_refs:	hold references to parent pins pin was
>>>>>>>>>>>registered
>>>>>>>>>>>with
>>>>>>>>>>>- * @prop:		pointer to pin properties given by registerer
>>>>>>>>>>>+ * @prop:		pin properties copied from the registerer
>>>>>>>>>>>  * @rclk_dev_name:	holds name of device when pin can recover
>>>>>>>>>>>clock
>>>>>>>>>>>from it
>>>>>>>>>>>  * @refcount:		refcount
>>>>>>>>>>>  **/
>>>>>>>>>>>@@ -55,7 +55,7 @@ struct dpll_pin {
>>>>>>>>>>> 	struct module *module;
>>>>>>>>>>> 	struct xarray dpll_refs;
>>>>>>>>>>> 	struct xarray parent_refs;
>>>>>>>>>>>-	const struct dpll_pin_properties *prop;
>>>>>>>>>>>+	struct dpll_pin_properties prop;
>>>>>>>>>>> 	refcount_t refcount;
>>>>>>>>>>> };
>>>>>>>>>>>
>>>>>>>>>>>diff --git a/drivers/dpll/dpll_netlink.c
>>>>>>>>>>>b/drivers/dpll/dpll_netlink.c
>>>>>>>>>>>index 93fc6c4b8a78..963bbbbe6660 100644
>>>>>>>>>>>--- a/drivers/dpll/dpll_netlink.c
>>>>>>>>>>>+++ b/drivers/dpll/dpll_netlink.c
>>>>>>>>>>>@@ -278,17 +278,17 @@ dpll_msg_add_pin_freq(struct sk_buff *msg,
>>>>>>>>>>>struct
>>>>>>>>>>>dpll_pin *pin,
>>>>>>>>>>> 	if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY, sizeof(freq),
>>>>>>>>>>>&freq,
>>>>>>>>>>> 			  DPLL_A_PIN_PAD))
>>>>>>>>>>> 		return -EMSGSIZE;
>>>>>>>>>>>-	for (fs = 0; fs < pin->prop->freq_supported_num; fs++) {
>>>>>>>>>>>+	for (fs = 0; fs < pin->prop.freq_supported_num; fs++) {
>>>>>>>>>>> 		nest = nla_nest_start(msg,
>>>>>>>>>>>DPLL_A_PIN_FREQUENCY_SUPPORTED);
>>>>>>>>>>> 		if (!nest)
>>>>>>>>>>> 			return -EMSGSIZE;
>>>>>>>>>>>-		freq = pin->prop->freq_supported[fs].min;
>>>>>>>>>>>+		freq = pin->prop.freq_supported[fs].min;
>>>>>>>>>>> 		if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY_MIN,
>>>>>>>>>>>sizeof(freq),
>>>>>>>>>>> 				  &freq, DPLL_A_PIN_PAD)) {
>>>>>>>>>>> 			nla_nest_cancel(msg, nest);
>>>>>>>>>>> 			return -EMSGSIZE;
>>>>>>>>>>> 		}
>>>>>>>>>>>-		freq = pin->prop->freq_supported[fs].max;
>>>>>>>>>>>+		freq = pin->prop.freq_supported[fs].max;
>>>>>>>>>>> 		if (nla_put_64bit(msg, DPLL_A_PIN_FREQUENCY_MAX,
>>>>>>>>>>>sizeof(freq),
>>>>>>>>>>> 				  &freq, DPLL_A_PIN_PAD)) {
>>>>>>>>>>> 			nla_nest_cancel(msg, nest);
>>>>>>>>>>>@@ -304,9 +304,9 @@ static bool dpll_pin_is_freq_supported(struct
>>>>>>>>>>>dpll_pin
>>>>>>>>>>>*pin, u32 freq)
>>>>>>>>>>> {
>>>>>>>>>>> 	int fs;
>>>>>>>>>>>
>>>>>>>>>>>-	for (fs = 0; fs < pin->prop->freq_supported_num; fs++)
>>>>>>>>>>>-		if (freq >= pin->prop->freq_supported[fs].min &&
>>>>>>>>>>>-		    freq <= pin->prop->freq_supported[fs].max)
>>>>>>>>>>>+	for (fs = 0; fs < pin->prop.freq_supported_num; fs++)
>>>>>>>>>>>+		if (freq >= pin->prop.freq_supported[fs].min &&
>>>>>>>>>>>+		    freq <= pin->prop.freq_supported[fs].max)
>>>>>>>>>>> 			return true;
>>>>>>>>>>> 	return false;
>>>>>>>>>>> }
>>>>>>>>>>>@@ -403,7 +403,7 @@ static int
>>>>>>>>>>> dpll_cmd_pin_get_one(struct sk_buff *msg, struct dpll_pin *pin,
>>>>>>>>>>> 		     struct netlink_ext_ack *extack)
>>>>>>>>>>> {
>>>>>>>>>>>-	const struct dpll_pin_properties *prop = pin->prop;
>>>>>>>>>>>+	const struct dpll_pin_properties *prop = &pin->prop;
>>>>>>>>>>> 	struct dpll_pin_ref *ref;
>>>>>>>>>>> 	int ret;
>>>>>>>>>>>
>>>>>>>>>>>@@ -696,7 +696,7 @@ dpll_pin_on_pin_state_set(struct dpll_pin
>>>>>>>>>>>*pin,
>>>>>>>>>>>u32
>>>>>>>>>>>parent_idx,
>>>>>>>>>>> 	int ret;
>>>>>>>>>>>
>>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE &
>>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "state changing is not allowed");
>>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>>> 	}
>>>>>>>>>>>@@ -732,7 +732,7 @@ dpll_pin_state_set(struct dpll_device *dpll,
>>>>>>>>>>>struct
>>>>>>>>>>>dpll_pin *pin,
>>>>>>>>>>> 	int ret;
>>>>>>>>>>>
>>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_STATE_CAN_CHANGE &
>>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "state changing is not allowed");
>>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>>> 	}
>>>>>>>>>>>@@ -759,7 +759,7 @@ dpll_pin_prio_set(struct dpll_device *dpll,
>>>>>>>>>>>struct
>>>>>>>>>>>dpll_pin *pin,
>>>>>>>>>>> 	int ret;
>>>>>>>>>>>
>>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_PRIORITY_CAN_CHANGE &
>>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "prio changing is not allowed");
>>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>>> 	}
>>>>>>>>>>>@@ -787,7 +787,7 @@ dpll_pin_direction_set(struct dpll_pin *pin,
>>>>>>>>>>>struct
>>>>>>>>>>>dpll_device *dpll,
>>>>>>>>>>> 	int ret;
>>>>>>>>>>>
>>>>>>>>>>> 	if (!(DPLL_PIN_CAPABILITIES_DIRECTION_CAN_CHANGE &
>>>>>>>>>>>-	      pin->prop->capabilities)) {
>>>>>>>>>>>+	      pin->prop.capabilities)) {
>>>>>>>>>>> 		NL_SET_ERR_MSG(extack, "direction changing is not
>>>>>>>>>>>allowed");
>>>>>>>>>>> 		return -EOPNOTSUPP;
>>>>>>>>>>> 	}
>>>>>>>>>>>@@ -817,8 +817,8 @@ dpll_pin_phase_adj_set(struct dpll_pin *pin,
>>>>>>>>>>>struct
>>>>>>>>>>>nlattr *phase_adj_attr,
>>>>>>>>>>> 	int ret;
>>>>>>>>>>>
>>>>>>>>>>> 	phase_adj = nla_get_s32(phase_adj_attr);
>>>>>>>>>>>-	if (phase_adj > pin->prop->phase_range.max ||
>>>>>>>>>>>-	    phase_adj < pin->prop->phase_range.min) {
>>>>>>>>>>>+	if (phase_adj > pin->prop.phase_range.max ||
>>>>>>>>>>>+	    phase_adj < pin->prop.phase_range.min) {
>>>>>>>>>>> 		NL_SET_ERR_MSG_ATTR(extack, phase_adj_attr,
>>>>>>>>>>> 				    "phase adjust value not supported");
>>>>>>>>>>> 		return -EINVAL;
>>>>>>>>>>>@@ -999,7 +999,7 @@ dpll_pin_find(u64 clock_id, struct nlattr
>>>>>>>>>>>*mod_name_attr,
>>>>>>>>>>> 	unsigned long i;
>>>>>>>>>>>
>>>>>>>>>>> 	xa_for_each_marked(&dpll_pin_xa, i, pin, DPLL_REGISTERED) {
>>>>>>>>>>>-		prop = pin->prop;
>>>>>>>>>>>+		prop = &pin->prop;
>>>>>>>>>>> 		cid_match = clock_id ? pin->clock_id == clock_id : true;
>>>>>>>>>>> 		mod_match = mod_name_attr && module_name(pin->module) ?
>>>>>>>>>>> 			!nla_strcmp(mod_name_attr,
>>>>>>>>>>>--
>>>>>>>>>>>2.38.1
>>>>>>>>>>>
>>>>>>>>>
>>>>>

^ permalink raw reply

* [RFC ipsec-next 2/8] iptfs: uapi: ip: add ip_tfs_*_hdr packet formats
From: Christian Hopps @ 2023-11-10 11:37 UTC (permalink / raw)
  To: devel; +Cc: Steffen Klassert, netdev, Christian Hopps
In-Reply-To: <20231110113719.3055788-1-chopps@chopps.org>

From: Christian Hopps <chopps@labn.net>

Add the on-wire basic and congestion-control IP-TFS packet headers.

Signed-off-by: Christian Hopps <chopps@labn.net>
---
 include/uapi/linux/ip.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/uapi/linux/ip.h b/include/uapi/linux/ip.h
index 283dec7e3645..cc83878ecf08 100644
--- a/include/uapi/linux/ip.h
+++ b/include/uapi/linux/ip.h
@@ -137,6 +137,23 @@ struct ip_beet_phdr {
 	__u8 reserved;
 };
 
+struct ip_iptfs_hdr {
+	__u8 subtype;		/* 0*: basic, 1: CC */
+	__u8 flags;
+	__be16 block_offset;
+};
+
+struct ip_iptfs_cc_hdr {
+	__u8 subtype;		/* 0: basic, 1*: CC */
+	__u8 flags;
+	__be16 block_offset;
+	__be32 loss_rate;
+	__u8 rtt_and_adelay1[4];
+	__u8 adelay2_and_xdelay[4];
+	__be32 tval;
+	__be32 techo;
+};
+
 /* index values for the variables in ipv4_devconf */
 enum
 {
-- 
2.42.0


^ permalink raw reply related

* [RFC ipsec-next 3/8] iptfs: uapi: IPPROTO_AGGFRAG AGGFRAG in ESP
From: Christian Hopps @ 2023-11-10 11:37 UTC (permalink / raw)
  To: devel; +Cc: Steffen Klassert, netdev, Christian Hopps
In-Reply-To: <20231110113719.3055788-1-chopps@chopps.org>

From: Christian Hopps <chopps@labn.net>

Add the RFC assigned IP protocol number for AGGFRAG.

Signed-off-by: Christian Hopps <chopps@labn.net>
---
 include/uapi/linux/in.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
index e682ab628dfa..e6a1f3e4c58c 100644
--- a/include/uapi/linux/in.h
+++ b/include/uapi/linux/in.h
@@ -79,6 +79,8 @@ enum {
 #define IPPROTO_MPLS		IPPROTO_MPLS
   IPPROTO_ETHERNET = 143,	/* Ethernet-within-IPv6 Encapsulation	*/
 #define IPPROTO_ETHERNET	IPPROTO_ETHERNET
+  IPPROTO_AGGFRAG = 144,	/* AGGFRAG in ESP (RFC 9347)		*/
+#define IPPROTO_AGGFRAG		IPPROTO_AGGFRAG
   IPPROTO_RAW = 255,		/* Raw IP packets			*/
 #define IPPROTO_RAW		IPPROTO_RAW
   IPPROTO_MPTCP = 262,		/* Multipath TCP connection		*/
-- 
2.42.0


^ permalink raw reply related

* [RFC ipsec-next] Add IP-TFS mode to xfrm
From: Christian Hopps @ 2023-11-10 11:37 UTC (permalink / raw)
  To: devel; +Cc: Steffen Klassert, netdev, Christian Hopps

From: Christian Hopps <chopps@labn.net>

This patchset adds a new xfrm mode implementing on-demand IP-TFS. IP-TFS
(AggFrag encapsulation) has been standardized in RFC9347.

Link: https://www.rfc-editor.org/rfc/rfc9347.txt

In order to allow loading this fucntionality as a module a set of callbacks
xfrm_mode_cbs has been added to xfrm as well.

^ permalink raw reply

* [RFC ipsec-next 5/8] iptfs: netlink: add config (netlink) options
From: Christian Hopps @ 2023-11-10 11:37 UTC (permalink / raw)
  To: devel; +Cc: Steffen Klassert, netdev, Christian Hopps
In-Reply-To: <20231110113719.3055788-1-chopps@chopps.org>

From: Christian Hopps <chopps@labn.net>

Add netlink options for configuring IP-TFS SAs.

Signed-off-by: Christian Hopps <chopps@labn.net>
---
 include/uapi/linux/xfrm.h |  6 ++++++
 net/xfrm/xfrm_user.c      | 16 ++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index 6a77328be114..fa6d264f2ad1 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -315,6 +315,12 @@ enum xfrm_attr_type_t {
 	XFRMA_SET_MARK_MASK,	/* __u32 */
 	XFRMA_IF_ID,		/* __u32 */
 	XFRMA_MTIMER_THRESH,	/* __u32 in seconds for input SA */
+	XFRMA_IPTFS_PKT_SIZE,	/* __u32 Size of outer packet, 0 for PMTU */
+	XFRMA_IPTFS_MAX_QSIZE,	/* __u32 max ingress queue size */
+	XFRMA_IPTFS_DONT_FRAG,	/* don't use fragmentation */
+	XFRMA_IPTFS_DROP_TIME,	/* __u32 usec to wait for next seq */
+	XFRMA_IPTFS_REORD_WIN,	/* __u16 reorder window size */
+	XFRMA_IPTFS_IN_DELAY,	/* __u32 initial packet wait delay (usec) */
 	__XFRMA_MAX
 
 #define XFRMA_OUTPUT_MARK XFRMA_SET_MARK	/* Compatibility */
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index ad01997c3aa9..ed95772bbd3f 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -272,6 +272,16 @@ static int verify_newsa_info(struct xfrm_usersa_info *p,
 			NL_SET_ERR_MSG(extack, "TFC padding can only be used in tunnel mode");
 			goto out;
 		}
+		if ((attrs[XFRMA_IPTFS_PKT_SIZE] ||
+		     attrs[XFRMA_IPTFS_MAX_QSIZE] ||
+		     attrs[XFRMA_IPTFS_DONT_FRAG] ||
+		     attrs[XFRMA_IPTFS_DROP_TIME] ||
+		     attrs[XFRMA_IPTFS_REORD_WIN] ||
+		     attrs[XFRMA_IPTFS_IN_DELAY]) &&
+		    p->mode != XFRM_MODE_IPTFS) {
+			NL_SET_ERR_MSG(extack, "IPTFS options can only be used in IPTFS mode");
+			goto out;
+		}
 		break;
 
 	case IPPROTO_COMP:
@@ -3046,6 +3056,12 @@ const struct nla_policy xfrma_policy[XFRMA_MAX+1] = {
 	[XFRMA_SET_MARK_MASK]	= { .type = NLA_U32 },
 	[XFRMA_IF_ID]		= { .type = NLA_U32 },
 	[XFRMA_MTIMER_THRESH]   = { .type = NLA_U32 },
+	[XFRMA_IPTFS_PKT_SIZE]	= { .type = NLA_U32 },
+	[XFRMA_IPTFS_MAX_QSIZE]	= { .type = NLA_U32 },
+	[XFRMA_IPTFS_DONT_FRAG]	= { .type = NLA_FLAG },
+	[XFRMA_IPTFS_DROP_TIME]	= { .type = NLA_U32 },
+	[XFRMA_IPTFS_REORD_WIN]	= { .type = NLA_U16 },
+	[XFRMA_IPTFS_IN_DELAY]	= { .type = NLA_U32 },
 };
 EXPORT_SYMBOL_GPL(xfrma_policy);
 
-- 
2.42.0


^ permalink raw reply related

* [RFC ipsec-next 4/8] iptfs: sysctl: allow configuration of global default values
From: Christian Hopps @ 2023-11-10 11:37 UTC (permalink / raw)
  To: devel; +Cc: Steffen Klassert, netdev, Christian Hopps
In-Reply-To: <20231110113719.3055788-1-chopps@chopps.org>

From: Christian Hopps <chopps@labn.net>

Add sysctls for the changing the IPTFS default SA values.

Signed-off-by: Christian Hopps <chopps@labn.net>
---
 Documentation/networking/xfrm_sysctl.rst | 29 ++++++++++++++++++
 include/net/netns/xfrm.h                 |  6 ++++
 include/net/xfrm.h                       |  7 +++++
 net/xfrm/xfrm_sysctl.c                   | 38 ++++++++++++++++++++++++
 4 files changed, 80 insertions(+)

diff --git a/Documentation/networking/xfrm_sysctl.rst b/Documentation/networking/xfrm_sysctl.rst
index 47b9bbdd0179..365220e4a072 100644
--- a/Documentation/networking/xfrm_sysctl.rst
+++ b/Documentation/networking/xfrm_sysctl.rst
@@ -9,3 +9,32 @@ XFRM Syscall
 
 xfrm_acq_expires - INTEGER
 	default 30 - hard timeout in seconds for acquire requests
+
+xfrm_iptfs_maxqsize - UNSIGNED INTEGER
+        The default IPTFS max output queue size. The output queue is where
+        received packets destined for output over an IPTFS tunnel are stored
+        prior to being output in aggregated/fragmented form over the IPTFS
+        tunnel.
+
+        Default 1M.
+
+xfrm_iptfs_drptime - UNSIGNED INTEGER
+        The default IPTFS drop time. The drop time is the amount of time before
+        a missing out-of-order IPTFS tunnel packet is considered lost. See also
+        the reorder window.
+
+        Default 1s (1000000).
+
+xfrm_iptfs_idelay - UNSIGNED INTEGER
+        The default IPTFS initial output delay. The initial output delay is the
+        amount of time prior to servicing the output queue after queueing the
+        first packet on said queue.
+
+        Default 0.
+
+xfrm_iptfs_rewin - UNSIGNED INTEGER
+        The default IPTFS reorder window size. The reorder window size dictates
+        the maximum number of IPTFS tunnel packets in a sequence that may arrive
+        out of order.
+
+        Default 3.
diff --git a/include/net/netns/xfrm.h b/include/net/netns/xfrm.h
index bd7c3be4af5d..d5ad2155d0bb 100644
--- a/include/net/netns/xfrm.h
+++ b/include/net/netns/xfrm.h
@@ -65,6 +65,12 @@ struct netns_xfrm {
 	u32			sysctl_aevent_rseqth;
 	int			sysctl_larval_drop;
 	u32			sysctl_acq_expires;
+#if IS_ENABLED(CONFIG_XFRM_IPTFS)
+	u32			sysctl_iptfs_drptime;
+	u32			sysctl_iptfs_idelay;
+	u32			sysctl_iptfs_maxqsize;
+	u32			sysctl_iptfs_rewin;
+#endif
 
 	u8			policy_default[XFRM_POLICY_MAX];
 
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index c9bb0f892f55..d2e87344d175 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -2190,4 +2190,11 @@ static inline int register_xfrm_interface_bpf(void)
 
 #endif
 
+#if IS_ENABLED(CONFIG_XFRM_IPTFS)
+#define XFRM_IPTFS_DEFAULT_MAX_QUEUE_SIZE (1024 * 1024)
+#define XFRM_IPTFS_DEFAULT_INIT_DELAY_USECS (0)
+#define XFRM_IPTFS_DEFAULT_DROP_TIME_USECS (1000000)
+#define XFRM_IPTFS_DEFAULT_REORDER_WINDOW (3)
+#endif
+
 #endif	/* _NET_XFRM_H */
diff --git a/net/xfrm/xfrm_sysctl.c b/net/xfrm/xfrm_sysctl.c
index 7fdeafc838a7..bf8e73a6c38e 100644
--- a/net/xfrm/xfrm_sysctl.c
+++ b/net/xfrm/xfrm_sysctl.c
@@ -10,6 +10,12 @@ static void __net_init __xfrm_sysctl_init(struct net *net)
 	net->xfrm.sysctl_aevent_rseqth = XFRM_AE_SEQT_SIZE;
 	net->xfrm.sysctl_larval_drop = 1;
 	net->xfrm.sysctl_acq_expires = 30;
+#if IS_ENABLED(CONFIG_XFRM_IPTFS)
+	net->xfrm.sysctl_iptfs_maxqsize = XFRM_IPTFS_DEFAULT_MAX_QUEUE_SIZE;
+	net->xfrm.sysctl_iptfs_drptime = XFRM_IPTFS_DEFAULT_DROP_TIME_USECS;
+	net->xfrm.sysctl_iptfs_idelay = XFRM_IPTFS_DEFAULT_INIT_DELAY_USECS;
+	net->xfrm.sysctl_iptfs_rewin = XFRM_IPTFS_DEFAULT_REORDER_WINDOW;
+#endif
 }
 
 #ifdef CONFIG_SYSCTL
@@ -38,6 +44,32 @@ static struct ctl_table xfrm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+#if IS_ENABLED(CONFIG_XFRM_IPTFS)
+	{
+		.procname	= "xfrm_iptfs_drptime",
+		.maxlen		= sizeof(uint),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec
+	},
+	{
+		.procname	= "xfrm_iptfs_idelay",
+		.maxlen		= sizeof(uint),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec
+	},
+	{
+		.procname	= "xfrm_iptfs_maxqsize",
+		.maxlen		= sizeof(uint),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec
+	},
+	{
+		.procname	= "xfrm_iptfs_rewin",
+		.maxlen		= sizeof(uint),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec
+	},
+#endif
 	{}
 };
 
@@ -55,6 +87,12 @@ int __net_init xfrm_sysctl_init(struct net *net)
 	table[1].data = &net->xfrm.sysctl_aevent_rseqth;
 	table[2].data = &net->xfrm.sysctl_larval_drop;
 	table[3].data = &net->xfrm.sysctl_acq_expires;
+#if IS_ENABLED(CONFIG_XFRM_IPTFS)
+	table[4].data = &net->xfrm.sysctl_iptfs_drptime;
+	table[5].data = &net->xfrm.sysctl_iptfs_idelay;
+	table[6].data = &net->xfrm.sysctl_iptfs_maxqsize;
+	table[7].data = &net->xfrm.sysctl_iptfs_rewin;
+#endif
 
 	/* Don't export sysctls to unprivileged users */
 	if (net->user_ns != &init_user_ns) {
-- 
2.42.0


^ permalink raw reply related

* [RFC ipsec-next 6/8] iptfs: xfrm: Add mode_cbs module functionality
From: Christian Hopps @ 2023-11-10 11:37 UTC (permalink / raw)
  To: devel; +Cc: Steffen Klassert, netdev, Christian Hopps
In-Reply-To: <20231110113719.3055788-1-chopps@chopps.org>

From: Christian Hopps <chopps@labn.net>

Add a set of callbacks xfrm_mode_cbs to xfrm_state. These callbacks
enable the addition of new xfrm modes, such as IP-TFS to be defined
in modules.

Signed-off-by: Christian Hopps <chopps@labn.net>
---
 include/net/xfrm.h     | 36 ++++++++++++++++++++++++++++++++++++
 net/xfrm/xfrm_device.c |  3 ++-
 net/xfrm/xfrm_input.c  | 14 ++++++++++++--
 net/xfrm/xfrm_output.c |  9 +++++++--
 net/xfrm/xfrm_policy.c | 18 +++++++++++-------
 net/xfrm/xfrm_state.c  | 41 +++++++++++++++++++++++++++++++++++++++++
 net/xfrm/xfrm_user.c   | 10 ++++++++++
 7 files changed, 119 insertions(+), 12 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index d2e87344d175..aeeadadc9545 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -204,6 +204,7 @@ struct xfrm_state {
 		u16		family;
 		xfrm_address_t	saddr;
 		int		header_len;
+		int		enc_hdr_len;
 		int		trailer_len;
 		u32		extra_flags;
 		struct xfrm_mark	smark;
@@ -289,6 +290,9 @@ struct xfrm_state {
 	/* Private data of this transformer, format is opaque,
 	 * interpreted by xfrm_type methods. */
 	void			*data;
+
+	const struct xfrm_mode_cbs	*mode_cbs;
+	void				*mode_data;
 };
 
 static inline struct net *xs_net(struct xfrm_state *x)
@@ -441,6 +445,38 @@ struct xfrm_type_offload {
 int xfrm_register_type_offload(const struct xfrm_type_offload *type, unsigned short family);
 void xfrm_unregister_type_offload(const struct xfrm_type_offload *type, unsigned short family);
 
+struct xfrm_mode_cbs {
+	struct module	*owner;
+	/* Add/delete state in the new xfrm_state in `x`. */
+	int	(*create_state)(struct xfrm_state *x);
+	void	(*delete_state)(struct xfrm_state *x);
+
+	/* Called while handling the user netlink options. */
+	int	(*user_init)(struct net *net, struct xfrm_state *x,
+			     struct nlattr **attrs);
+	int	(*copy_to_user)(struct xfrm_state *x, struct sk_buff *skb);
+
+	u32	(*get_inner_mtu)(struct xfrm_state *x, int outer_mtu);
+
+	/* Called to handle received xfrm (egress) packets. */
+	int	(*input)(struct xfrm_state *x, struct sk_buff *skb);
+
+	/* Placed in dst_output of the dst when an xfrm_state is bound. */
+	int	(*output)(struct net *net, struct sock *sk, struct sk_buff *skb);
+
+	/**
+	 * Prepare the skb for output for the given mode. Returns:
+	 *    Error value, if 0 then skb values should be as follows:
+	 *    transport_header should point at ESP header
+	 *    network_header should point at Outer IP header
+	 *    mac_header should point at protocol/nexthdr of the outer IP
+	 */
+	int	(*prepare_output)(struct xfrm_state *x, struct sk_buff *skb);
+};
+
+int xfrm_register_mode_cbs(u8 mode, const struct xfrm_mode_cbs *mode_cbs);
+void xfrm_unregister_mode_cbs(u8 mode);
+
 static inline int xfrm_af2proto(unsigned int family)
 {
 	switch(family) {
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 3784534c9185..8b848540ea47 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -42,7 +42,8 @@ static void __xfrm_mode_tunnel_prep(struct xfrm_state *x, struct sk_buff *skb,
 		skb->transport_header = skb->network_header + hsize;
 
 	skb_reset_mac_len(skb);
-	pskb_pull(skb, skb->mac_len + x->props.header_len);
+	pskb_pull(skb,
+		  skb->mac_len + x->props.header_len - x->props.enc_hdr_len);
 }
 
 static void __xfrm_mode_beet_prep(struct xfrm_state *x, struct sk_buff *skb,
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index bd4ce21d76d7..824f7b7f90e0 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -437,6 +437,9 @@ static int xfrm_inner_mode_input(struct xfrm_state *x,
 		WARN_ON_ONCE(1);
 		break;
 	default:
+		if (x->mode_cbs && x->mode_cbs->input)
+			return x->mode_cbs->input(x, skb);
+
 		WARN_ON_ONCE(1);
 		break;
 	}
@@ -479,6 +482,10 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 
 		family = x->props.family;
 
+		/* An encap_type of -3 indicates reconstructed inner packet */
+		if (encap_type == -3)
+			goto resume_decapped;
+
 		/* An encap_type of -1 indicates async resumption. */
 		if (encap_type == -1) {
 			async = 1;
@@ -660,11 +667,14 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type)
 
 		XFRM_MODE_SKB_CB(skb)->protocol = nexthdr;
 
-		if (xfrm_inner_mode_input(x, skb)) {
+		err = xfrm_inner_mode_input(x, skb);
+		if (err == -EINPROGRESS)
+			return 0;
+		else if (err) {
 			XFRM_INC_STATS(net, LINUX_MIB_XFRMINSTATEMODEERROR);
 			goto drop;
 		}
-
+resume_decapped:
 		if (x->outer_mode.flags & XFRM_MODE_FLAG_TUNNEL) {
 			decaps = 1;
 			break;
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 662c83beb345..4390c111410d 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -280,7 +280,9 @@ static int xfrm4_tunnel_encap_add(struct xfrm_state *x, struct sk_buff *skb)
 	skb_set_inner_network_header(skb, skb_network_offset(skb));
 	skb_set_inner_transport_header(skb, skb_transport_offset(skb));
 
-	skb_set_network_header(skb, -x->props.header_len);
+	/* backup to add space for the outer encap */
+	skb_set_network_header(skb,
+			       -x->props.header_len + x->props.enc_hdr_len);
 	skb->mac_header = skb->network_header +
 			  offsetof(struct iphdr, protocol);
 	skb->transport_header = skb->network_header + sizeof(*top_iph);
@@ -325,7 +327,8 @@ static int xfrm6_tunnel_encap_add(struct xfrm_state *x, struct sk_buff *skb)
 	skb_set_inner_network_header(skb, skb_network_offset(skb));
 	skb_set_inner_transport_header(skb, skb_transport_offset(skb));
 
-	skb_set_network_header(skb, -x->props.header_len);
+	skb_set_network_header(skb,
+			       -x->props.header_len + x->props.enc_hdr_len);
 	skb->mac_header = skb->network_header +
 			  offsetof(struct ipv6hdr, nexthdr);
 	skb->transport_header = skb->network_header + sizeof(*top_iph);
@@ -472,6 +475,8 @@ static int xfrm_outer_mode_output(struct xfrm_state *x, struct sk_buff *skb)
 		WARN_ON_ONCE(1);
 		break;
 	default:
+		if (x->mode_cbs->prepare_output)
+			return x->mode_cbs->prepare_output(x, skb);
 		WARN_ON_ONCE(1);
 		break;
 	}
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index d2dddc570f4f..3220b01121f3 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2707,13 +2707,17 @@ static struct dst_entry *xfrm_bundle_create(struct xfrm_policy *policy,
 
 		dst1->input = dst_discard;
 
-		rcu_read_lock();
-		afinfo = xfrm_state_afinfo_get_rcu(inner_mode->family);
-		if (likely(afinfo))
-			dst1->output = afinfo->output;
-		else
-			dst1->output = dst_discard_out;
-		rcu_read_unlock();
+		if (xfrm[i]->mode_cbs && xfrm[i]->mode_cbs->output) {
+			dst1->output = xfrm[i]->mode_cbs->output;
+		} else {
+			rcu_read_lock();
+			afinfo = xfrm_state_afinfo_get_rcu(inner_mode->family);
+			if (likely(afinfo))
+				dst1->output = afinfo->output;
+			else
+				dst1->output = dst_discard_out;
+			rcu_read_unlock();
+		}
 
 		xdst_prev = xdst;
 
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index bda5327bf34d..f5e1a17ebf74 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -513,6 +513,36 @@ static const struct xfrm_mode *xfrm_get_mode(unsigned int encap, int family)
 	return NULL;
 }
 
+static struct xfrm_mode_cbs xfrm_mode_cbs_map[XFRM_MODE_MAX];
+
+int xfrm_register_mode_cbs(u8 mode, const struct xfrm_mode_cbs *mode_cbs)
+{
+	if (mode >= XFRM_MODE_MAX)
+		return -EINVAL;
+
+	xfrm_mode_cbs_map[mode] = *mode_cbs;
+	return 0;
+}
+EXPORT_SYMBOL(xfrm_register_mode_cbs);
+
+void xfrm_unregister_mode_cbs(u8 mode)
+{
+	if (mode >= XFRM_MODE_MAX)
+		return;
+
+	memset(&xfrm_mode_cbs_map[mode], 0, sizeof(xfrm_mode_cbs_map[mode]));
+}
+EXPORT_SYMBOL(xfrm_unregister_mode_cbs);
+
+static const struct xfrm_mode_cbs *xfrm_get_mode_cbs(u8 mode)
+{
+	if (mode >= XFRM_MODE_MAX)
+		return NULL;
+	if (mode == XFRM_MODE_IPTFS && !xfrm_mode_cbs_map[mode].create_state)
+		request_module("xfrm-iptfs");
+	return &xfrm_mode_cbs_map[mode];
+}
+
 void xfrm_state_free(struct xfrm_state *x)
 {
 	kmem_cache_free(xfrm_state_cache, x);
@@ -521,6 +551,8 @@ EXPORT_SYMBOL(xfrm_state_free);
 
 static void ___xfrm_state_destroy(struct xfrm_state *x)
 {
+	if (x->mode_cbs && x->mode_cbs->delete_state)
+		x->mode_cbs->delete_state(x);
 	hrtimer_cancel(&x->mtimer);
 	del_timer_sync(&x->rtimer);
 	kfree(x->aead);
@@ -2765,6 +2797,9 @@ u32 xfrm_state_mtu(struct xfrm_state *x, int mtu)
 	case XFRM_MODE_TUNNEL:
 		break;
 	default:
+		if (x->mode_cbs && x->mode_cbs->get_inner_mtu)
+			return x->mode_cbs->get_inner_mtu(x, mtu);
+
 		WARN_ON_ONCE(1);
 		break;
 	}
@@ -2850,6 +2885,12 @@ int __xfrm_init_state(struct xfrm_state *x, bool init_replay, bool offload,
 			goto error;
 	}
 
+	x->mode_cbs = xfrm_get_mode_cbs(x->props.mode);
+	if (x->mode_cbs && x->mode_cbs->create_state) {
+		err = x->mode_cbs->create_state(x);
+		if (err)
+			goto error;
+	}
 error:
 	return err;
 }
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index ed95772bbd3f..8a504331e369 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -779,6 +779,12 @@ static struct xfrm_state *xfrm_state_construct(struct net *net,
 			goto error;
 	}
 
+	if (x->mode_cbs && x->mode_cbs->user_init) {
+		err = x->mode_cbs->user_init(net, x, attrs);
+		if (err)
+			goto error;
+	}
+
 	return x;
 
 error:
@@ -1192,6 +1198,10 @@ static int copy_to_user_state_extra(struct xfrm_state *x,
 		if (ret)
 			goto out;
 	}
+	if (x->mode_cbs && x->mode_cbs->copy_to_user)
+		ret = x->mode_cbs->copy_to_user(x, skb);
+	if (ret)
+		goto out;
 	if (x->mapping_maxage)
 		ret = nla_put_u32(skb, XFRMA_MTIMER_THRESH, x->mapping_maxage);
 out:
-- 
2.42.0


^ permalink raw reply related

* [RFC ipsec-next 7/8] iptfs: xfrm: add generic iptfs defines and functionality
From: Christian Hopps @ 2023-11-10 11:37 UTC (permalink / raw)
  To: devel; +Cc: Steffen Klassert, netdev, Christian Hopps
In-Reply-To: <20231110113719.3055788-1-chopps@chopps.org>

From: Christian Hopps <chopps@labn.net>

Define `XFRM_MODE_IPTFS` and `IPSEC_MODE_IPTFS` constants, and add these to
switch case and conditionals adjacent with the existing TUNNEL modes.

Signed-off-by: Christian Hopps <chopps@labn.net>
---
 include/net/xfrm.h         |  1 +
 include/uapi/linux/ipsec.h |  3 ++-
 include/uapi/linux/snmp.h  |  2 ++
 include/uapi/linux/xfrm.h  |  3 ++-
 net/ipv4/esp4.c            |  3 ++-
 net/ipv6/esp6.c            |  3 ++-
 net/netfilter/nft_xfrm.c   |  3 ++-
 net/xfrm/xfrm_device.c     |  1 +
 net/xfrm/xfrm_output.c     |  4 ++++
 net/xfrm/xfrm_policy.c     |  8 ++++++--
 net/xfrm/xfrm_proc.c       |  2 ++
 net/xfrm/xfrm_state.c      | 12 ++++++++++++
 net/xfrm/xfrm_user.c       |  3 +++
 13 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index aeeadadc9545..a6e0e848918d 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -37,6 +37,7 @@
 #define XFRM_PROTO_COMP		108
 #define XFRM_PROTO_IPIP		4
 #define XFRM_PROTO_IPV6		41
+#define XFRM_PROTO_IPTFS	IPPROTO_AGGFRAG
 #define XFRM_PROTO_ROUTING	IPPROTO_ROUTING
 #define XFRM_PROTO_DSTOPTS	IPPROTO_DSTOPTS
 
diff --git a/include/uapi/linux/ipsec.h b/include/uapi/linux/ipsec.h
index 50d8ee1791e2..696b790f4346 100644
--- a/include/uapi/linux/ipsec.h
+++ b/include/uapi/linux/ipsec.h
@@ -14,7 +14,8 @@ enum {
 	IPSEC_MODE_ANY		= 0,	/* We do not support this for SA */
 	IPSEC_MODE_TRANSPORT	= 1,
 	IPSEC_MODE_TUNNEL	= 2,
-	IPSEC_MODE_BEET         = 3
+	IPSEC_MODE_BEET         = 3,
+	IPSEC_MODE_IPTFS        = 4
 };
 
 enum {
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 26f33a4c253d..d0b45f4c22c7 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -331,6 +331,8 @@ enum
 	LINUX_MIB_XFRMFWDHDRERROR,		/* XfrmFwdHdrError*/
 	LINUX_MIB_XFRMOUTSTATEINVALID,		/* XfrmOutStateInvalid */
 	LINUX_MIB_XFRMACQUIREERROR,		/* XfrmAcquireError */
+	LINUX_MIB_XFRMINIPTFSERROR,		/* XfrmInIptfsError */
+	LINUX_MIB_XFRMOUTNOQSPACE,		/* XfrmOutNoQueueSpace */
 	__LINUX_MIB_XFRMMAX
 };
 
diff --git a/include/uapi/linux/xfrm.h b/include/uapi/linux/xfrm.h
index fa6d264f2ad1..2e7ffc9b7309 100644
--- a/include/uapi/linux/xfrm.h
+++ b/include/uapi/linux/xfrm.h
@@ -153,7 +153,8 @@ enum {
 #define XFRM_MODE_ROUTEOPTIMIZATION 2
 #define XFRM_MODE_IN_TRIGGER 3
 #define XFRM_MODE_BEET 4
-#define XFRM_MODE_MAX 5
+#define XFRM_MODE_IPTFS 5
+#define XFRM_MODE_MAX 6
 
 /* Netlink configuration messages.  */
 enum {
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 2be2d4922557..b7047c0dd7ea 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -816,7 +816,8 @@ int esp_input_done2(struct sk_buff *skb, int err)
 	}
 
 	skb_pull_rcsum(skb, hlen);
-	if (x->props.mode == XFRM_MODE_TUNNEL)
+	if (x->props.mode == XFRM_MODE_TUNNEL ||
+	    x->props.mode == XFRM_MODE_IPTFS)
 		skb_reset_transport_header(skb);
 	else
 		skb_set_transport_header(skb, -ihl);
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index fddd0cbdede1..10f2190207a8 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -865,7 +865,8 @@ int esp6_input_done2(struct sk_buff *skb, int err)
 	skb_postpull_rcsum(skb, skb_network_header(skb),
 			   skb_network_header_len(skb));
 	skb_pull_rcsum(skb, hlen);
-	if (x->props.mode == XFRM_MODE_TUNNEL)
+	if (x->props.mode == XFRM_MODE_TUNNEL ||
+	    x->props.mode == XFRM_MODE_IPTFS)
 		skb_reset_transport_header(skb);
 	else
 		skb_set_transport_header(skb, -hdr_len);
diff --git a/net/netfilter/nft_xfrm.c b/net/netfilter/nft_xfrm.c
index 452f8587adda..291b029391cd 100644
--- a/net/netfilter/nft_xfrm.c
+++ b/net/netfilter/nft_xfrm.c
@@ -112,7 +112,8 @@ static bool xfrm_state_addr_ok(enum nft_xfrm_keys k, u8 family, u8 mode)
 		return true;
 	}
 
-	return mode == XFRM_MODE_BEET || mode == XFRM_MODE_TUNNEL;
+	return mode == XFRM_MODE_BEET || mode == XFRM_MODE_TUNNEL ||
+	       mode == XFRM_MODE_IPTFS;
 }
 
 static void nft_xfrm_state_get_key(const struct nft_xfrm *priv,
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 8b848540ea47..a40f5e09829e 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -69,6 +69,7 @@ static void __xfrm_mode_beet_prep(struct xfrm_state *x, struct sk_buff *skb,
 static void xfrm_outer_mode_prep(struct xfrm_state *x, struct sk_buff *skb)
 {
 	switch (x->outer_mode.encap) {
+	case XFRM_MODE_IPTFS:
 	case XFRM_MODE_TUNNEL:
 		if (x->outer_mode.family == AF_INET)
 			return __xfrm_mode_tunnel_prep(x, skb,
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index 4390c111410d..16c981ca61ca 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -680,6 +680,10 @@ static void xfrm_get_inner_ipproto(struct sk_buff *skb, struct xfrm_state *x)
 
 		return;
 	}
+	if (x->outer_mode.encap == XFRM_MODE_IPTFS) {
+		xo->inner_ipproto = IPPROTO_AGGFRAG;
+		return;
+	}
 
 	/* non-Tunnel Mode */
 	if (!skb->encapsulation)
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 3220b01121f3..94e5889a77d6 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2468,6 +2468,7 @@ xfrm_tmpl_resolve_one(struct xfrm_policy *policy, const struct flowi *fl,
 		struct xfrm_tmpl *tmpl = &policy->xfrm_vec[i];
 
 		if (tmpl->mode == XFRM_MODE_TUNNEL ||
+		    tmpl->mode == XFRM_MODE_IPTFS ||
 		    tmpl->mode == XFRM_MODE_BEET) {
 			remote = &tmpl->id.daddr;
 			local = &tmpl->saddr;
@@ -3252,7 +3253,8 @@ struct dst_entry *xfrm_lookup_with_ifid(struct net *net,
 ok:
 	xfrm_pols_put(pols, drop_pols);
 	if (dst && dst->xfrm &&
-	    dst->xfrm->props.mode == XFRM_MODE_TUNNEL)
+	    (dst->xfrm->props.mode == XFRM_MODE_TUNNEL ||
+	     dst->xfrm->props.mode == XFRM_MODE_IPTFS))
 		dst->flags |= DST_XFRM_TUNNEL;
 	return dst;
 
@@ -4353,6 +4355,7 @@ static int migrate_tmpl_match(const struct xfrm_migrate *m, const struct xfrm_tm
 		switch (t->mode) {
 		case XFRM_MODE_TUNNEL:
 		case XFRM_MODE_BEET:
+		case XFRM_MODE_IPTFS:
 			if (xfrm_addr_equal(&t->id.daddr, &m->old_daddr,
 					    m->old_family) &&
 			    xfrm_addr_equal(&t->saddr, &m->old_saddr,
@@ -4395,7 +4398,8 @@ static int xfrm_policy_migrate(struct xfrm_policy *pol,
 				continue;
 			n++;
 			if (pol->xfrm_vec[i].mode != XFRM_MODE_TUNNEL &&
-			    pol->xfrm_vec[i].mode != XFRM_MODE_BEET)
+			    pol->xfrm_vec[i].mode != XFRM_MODE_BEET &&
+			    pol->xfrm_vec[i].mode != XFRM_MODE_IPTFS)
 				continue;
 			/* update endpoints */
 			memcpy(&pol->xfrm_vec[i].id.daddr, &mp->new_daddr,
diff --git a/net/xfrm/xfrm_proc.c b/net/xfrm/xfrm_proc.c
index fee9b5cf37a7..d92b1b760749 100644
--- a/net/xfrm/xfrm_proc.c
+++ b/net/xfrm/xfrm_proc.c
@@ -41,6 +41,8 @@ static const struct snmp_mib xfrm_mib_list[] = {
 	SNMP_MIB_ITEM("XfrmFwdHdrError", LINUX_MIB_XFRMFWDHDRERROR),
 	SNMP_MIB_ITEM("XfrmOutStateInvalid", LINUX_MIB_XFRMOUTSTATEINVALID),
 	SNMP_MIB_ITEM("XfrmAcquireError", LINUX_MIB_XFRMACQUIREERROR),
+	SNMP_MIB_ITEM("XfrmInIptfsError", LINUX_MIB_XFRMINIPTFSERROR),
+	SNMP_MIB_ITEM("XfrmOutNoQueueSpace", LINUX_MIB_XFRMOUTNOQSPACE),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index f5e1a17ebf74..786f3fc0d428 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -465,6 +465,11 @@ static const struct xfrm_mode xfrm4_mode_map[XFRM_MODE_MAX] = {
 		.flags = XFRM_MODE_FLAG_TUNNEL,
 		.family = AF_INET,
 	},
+	[XFRM_MODE_IPTFS] = {
+		.encap = XFRM_MODE_IPTFS,
+		.flags = XFRM_MODE_FLAG_TUNNEL,
+		.family = AF_INET,
+	},
 };
 
 static const struct xfrm_mode xfrm6_mode_map[XFRM_MODE_MAX] = {
@@ -486,6 +491,11 @@ static const struct xfrm_mode xfrm6_mode_map[XFRM_MODE_MAX] = {
 		.flags = XFRM_MODE_FLAG_TUNNEL,
 		.family = AF_INET6,
 	},
+	[XFRM_MODE_IPTFS] = {
+		.encap = XFRM_MODE_IPTFS,
+		.flags = XFRM_MODE_FLAG_TUNNEL,
+		.family = AF_INET6,
+	},
 };
 
 static const struct xfrm_mode *xfrm_get_mode(unsigned int encap, int family)
@@ -2083,6 +2093,7 @@ static int __xfrm6_state_sort_cmp(const void *p)
 #endif
 	case XFRM_MODE_TUNNEL:
 	case XFRM_MODE_BEET:
+	case XFRM_MODE_IPTFS:
 		return 4;
 	}
 	return 5;
@@ -2109,6 +2120,7 @@ static int __xfrm6_tmpl_sort_cmp(const void *p)
 #endif
 	case XFRM_MODE_TUNNEL:
 	case XFRM_MODE_BEET:
+	case XFRM_MODE_IPTFS:
 		return 3;
 	}
 	return 4;
diff --git a/net/xfrm/xfrm_user.c b/net/xfrm/xfrm_user.c
index 8a504331e369..389656056326 100644
--- a/net/xfrm/xfrm_user.c
+++ b/net/xfrm/xfrm_user.c
@@ -353,6 +353,7 @@ static int verify_newsa_info(struct xfrm_usersa_info *p,
 	case XFRM_MODE_TUNNEL:
 	case XFRM_MODE_ROUTEOPTIMIZATION:
 	case XFRM_MODE_BEET:
+	case XFRM_MODE_IPTFS:
 		break;
 
 	default:
@@ -1830,6 +1831,8 @@ static int validate_tmpl(int nr, struct xfrm_user_tmpl *ut, u16 family,
 				return -EINVAL;
 			}
 			break;
+		case XFRM_MODE_IPTFS:
+			break;
 		default:
 			if (ut[i].family != prev_family) {
 				NL_SET_ERR_MSG(extack, "Mode in template doesn't support a family change");
-- 
2.42.0


^ permalink raw reply related

* [RFC ipsec-next 8/8] iptfs: impl: add new iptfs xfrm mode impl
From: Christian Hopps @ 2023-11-10 11:37 UTC (permalink / raw)
  To: devel; +Cc: Steffen Klassert, netdev, Christian Hopps
In-Reply-To: <20231110113719.3055788-1-chopps@chopps.org>

From: Christian Hopps <chopps@labn.net>

Add a new xfrm mode implementing AggFrag/IP-TFS from RFC9347.

This utilizes the new xfrm_mode_cbs to implement demand-driven IP-TFS
functionality. This functionality can be used to increase bandwidth
utilization through small packet aggregation, as well as help solve PMTU
issues through it's efficient use of fragmentation.

Link: https://www.rfc-editor.org/rfc/rfc9347.txt

Signed-off-by: Christian Hopps <chopps@labn.net>
---
 include/net/iptfs.h    |   18 +
 net/xfrm/trace_iptfs.h |  226 ++++
 net/xfrm/xfrm_iptfs.c  | 2735 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 2979 insertions(+)
 create mode 100644 include/net/iptfs.h
 create mode 100644 net/xfrm/trace_iptfs.h
 create mode 100644 net/xfrm/xfrm_iptfs.c

diff --git a/include/net/iptfs.h b/include/net/iptfs.h
new file mode 100644
index 000000000000..d8f2e494f251
--- /dev/null
+++ b/include/net/iptfs.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_IPTFS_H
+#define _NET_IPTFS_H
+
+#include <linux/types.h>
+#include <linux/ip.h>
+
+#define IPTFS_SUBTYPE_BASIC 0
+#define IPTFS_SUBTYPE_CC 1
+#define IPTFS_SUBTYPE_LAST IPTFS_SUBTYPE_CC
+
+#define IPTFS_CC_FLAGS_ECN_CE 0x1
+#define IPTFS_CC_FLAGS_PLMTUD 0x2
+
+extern void xfrm_iptfs_get_rtt_and_delays(struct ip_iptfs_cc_hdr *cch, u32 *rtt,
+					  u32 *actual_delay, u32 *xmit_delay);
+
+#endif /* _NET_IPTFS_H */
diff --git a/net/xfrm/trace_iptfs.h b/net/xfrm/trace_iptfs.h
new file mode 100644
index 000000000000..bade955942e3
--- /dev/null
+++ b/net/xfrm/trace_iptfs.h
@@ -0,0 +1,226 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* xfrm_trace_iptfs.h
+ *
+ * August 12 2023, Christian Hopps <chopps@labn.net>
+ *
+ * Copyright (c) 2023, LabN Consulting, L.L.C.
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM iptfs
+
+#if !defined(_TRACE_IPTFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_IPTFS_H
+
+#include <linux/kernel.h>
+#include <linux/skbuff.h>
+#include <linux/tracepoint.h>
+#include <net/ip.h>
+
+struct xfrm_iptfs_data;
+
+TRACE_EVENT(iptfs_egress_recv,
+	    TP_PROTO(struct sk_buff *skb, struct xfrm_iptfs_data *xtfs, u16 blkoff),
+	    TP_ARGS(skb, xtfs, blkoff),
+	    TP_STRUCT__entry(
+		    __field(struct sk_buff *, skb)
+		    __field(void *, head)
+		    __field(void *, head_pg_addr)
+		    __field(void *, pg0addr)
+		    __field(u32, skb_len)
+		    __field(u32, data_len)
+		    __field(u32, headroom)
+		    __field(u32, tailroom)
+		    __field(u32, tail)
+		    __field(u32, end)
+		    __field(u32, pg0off)
+		    __field(u8, head_frag)
+		    __field(u8, frag_list)
+		    __field(u8, nr_frags)
+		    __field(u16, blkoff)),
+	    TP_fast_assign(
+		    __entry->skb = skb;
+		    __entry->head = skb->head;
+		    __entry->skb_len = skb->len;
+		    __entry->data_len = skb->data_len;
+		    __entry->headroom = skb_headroom(skb);
+		    __entry->tailroom = skb_tailroom(skb);
+		    __entry->tail = skb->tail;
+		    __entry->end = skb->end;
+		    __entry->head_frag = skb->head_frag;
+		    __entry->frag_list = (bool)skb_shinfo(skb)->frag_list;
+		    __entry->nr_frags = skb_shinfo(skb)->nr_frags;
+		    __entry->blkoff = blkoff;
+		    __entry->head_pg_addr = page_address(virt_to_head_page(skb->head));
+		    __entry->pg0addr = (__entry->nr_frags ? page_address(skb_shinfo(skb)->frags[0].bv_page) : 0);
+		    __entry->pg0off = (__entry->nr_frags ? skb_shinfo(skb)->frags[0].bv_offset : 0);
+
+		    ),
+	    TP_printk("EGRESS: skb=%p len=%u data_len=%u headroom=%u head_frag=%u frag_list=%u nr_frags=%u blkoff=%u\n\t\ttailroom=%u tail=%u end=%u head=%p hdpgaddr=%p pg0->addr=%p pg0->data=%p pg0->off=%u",
+		      __entry->skb, __entry->skb_len, __entry->data_len, __entry->headroom,
+		      __entry->head_frag, __entry->frag_list, __entry->nr_frags,
+		      __entry->blkoff, __entry->tailroom, __entry->tail, __entry->end, __entry->head, __entry->head_pg_addr, __entry->pg0addr, __entry->pg0addr + __entry->pg0off, __entry->pg0off)
+	)
+
+
+DECLARE_EVENT_CLASS(iptfs_ingress_preq_event,
+		    TP_PROTO(struct sk_buff *skb, struct xfrm_iptfs_data *xtfs, u32 pmtu, u8 was_gso),
+		    TP_ARGS(skb, xtfs, pmtu, was_gso),
+		    TP_STRUCT__entry(
+			__field(struct sk_buff *, skb)
+			__field(u32, skb_len)
+			__field(u32, data_len)
+			__field(u32, pmtu)
+			__field(u32, queue_size)
+			__field(u32, proto_seq)
+			__field(u8, proto)
+		        __field(u8, was_gso)
+			    ),
+		    TP_fast_assign(
+			    __entry->skb = skb;
+			    __entry->skb_len = skb->len;
+			    __entry->data_len = skb->data_len;
+			    __entry->queue_size = xtfs->cfg.max_queue_size - xtfs->queue_size;
+			    __entry->proto = __trace_ip_proto(ip_hdr(skb));
+			    __entry->proto_seq = __trace_ip_proto_seq(ip_hdr(skb));
+			    __entry->pmtu = pmtu;
+			__entry->was_gso = was_gso;
+			    ),
+		    TP_printk("INGRPREQ: skb=%p len=%u data_len=%u qsize=%u proto=%u proto_seq=%u pmtu=%u was_gso=%u",
+			      __entry->skb, __entry->skb_len, __entry->data_len, __entry->queue_size,
+			      __entry->proto, __entry->proto_seq, __entry->pmtu, __entry->was_gso));
+
+DEFINE_EVENT(iptfs_ingress_preq_event, iptfs_enqueue,
+	     TP_PROTO(struct sk_buff *skb, struct xfrm_iptfs_data *xtfs, u32 pmtu, u8 was_gso),
+	     TP_ARGS(skb, xtfs, pmtu, was_gso));
+
+DEFINE_EVENT(iptfs_ingress_preq_event, iptfs_no_queue_space,
+	     TP_PROTO(struct sk_buff *skb, struct xfrm_iptfs_data *xtfs, u32 pmtu, u8 was_gso),
+	     TP_ARGS(skb, xtfs, pmtu, was_gso));
+
+DEFINE_EVENT(iptfs_ingress_preq_event, iptfs_too_big,
+	     TP_PROTO(struct sk_buff *skb, struct xfrm_iptfs_data *xtfs, u32 pmtu, u8 was_gso),
+	     TP_ARGS(skb, xtfs, pmtu, was_gso));
+
+DECLARE_EVENT_CLASS(
+	iptfs_ingress_postq_event,
+	TP_PROTO(struct sk_buff *skb, u32 mtu, u16 blkoff, struct iphdr *iph),
+	TP_ARGS(skb, mtu, blkoff, iph),
+	TP_STRUCT__entry(__field(struct sk_buff *, skb)
+			 __field(u32, skb_len)
+			 __field(u32, data_len)
+			 __field(u32, mtu)
+			 __field(u32, proto_seq)
+			 __field(u16, blkoff)
+			 __field(u8, proto)),
+	TP_fast_assign(__entry->skb = skb;
+		       __entry->skb_len = skb->len;
+		       __entry->data_len = skb->data_len;
+		       __entry->mtu = mtu;
+		       __entry->blkoff = blkoff;
+		       __entry->proto = iph ? __trace_ip_proto(iph) : 0;
+		       __entry->proto_seq = iph ? __trace_ip_proto_seq(iph) : 0;
+		),
+	TP_printk(
+		"INGRPSTQ: skb=%p len=%u data_len=%u mtu=%u blkoff=%u proto=%u proto_seq=%u",
+		__entry->skb, __entry->skb_len, __entry->data_len, __entry->mtu,
+		__entry->blkoff, __entry->proto, __entry->proto_seq));
+
+DEFINE_EVENT(iptfs_ingress_postq_event, iptfs_first_dequeue,
+	     TP_PROTO(struct sk_buff *skb, u32 mtu, u16 blkoff,
+		      struct iphdr *iph),
+	     TP_ARGS(skb, mtu, blkoff, iph));
+
+DEFINE_EVENT(iptfs_ingress_postq_event, iptfs_first_fragmenting,
+	     TP_PROTO(struct sk_buff *skb, u32 mtu, u16 blkoff,
+		      struct iphdr *iph),
+	     TP_ARGS(skb, mtu, blkoff, iph));
+
+DEFINE_EVENT(iptfs_ingress_postq_event, iptfs_first_final_fragment,
+	     TP_PROTO(struct sk_buff *skb, u32 mtu, u16 blkoff,
+		      struct iphdr *iph),
+	     TP_ARGS(skb, mtu, blkoff, iph));
+
+DEFINE_EVENT(iptfs_ingress_postq_event, iptfs_first_toobig,
+	     TP_PROTO(struct sk_buff *skb, u32 mtu, u16 blkoff,
+		      struct iphdr *iph),
+	     TP_ARGS(skb, mtu, blkoff, iph));
+
+TRACE_EVENT(iptfs_ingress_nth_peek,
+	    TP_PROTO(struct sk_buff *skb, u32 remaining),
+	    TP_ARGS(skb, remaining),
+	    TP_STRUCT__entry(__field(struct sk_buff *, skb)
+			     __field(u32, skb_len)
+			     __field(u32, remaining)),
+	    TP_fast_assign(__entry->skb = skb;
+			   __entry->skb_len = skb->len;
+			   __entry->remaining = remaining;
+		    ),
+	    TP_printk("INGRPSTQ: NTHPEEK: skb=%p len=%u remaining=%u", __entry->skb,
+		      __entry->skb_len, __entry->remaining));
+
+TRACE_EVENT(iptfs_ingress_nth_add, TP_PROTO(struct sk_buff *skb, u8 share_ok),
+	    TP_ARGS(skb, share_ok),
+	    TP_STRUCT__entry(__field(struct sk_buff *, skb)
+			     __field(u32, skb_len)
+			     __field(u32, data_len)
+			     __field(u8, share_ok)
+			     __field(u8, head_frag)
+			     __field(u8, pp_recycle)
+			     __field(u8, cloned)
+			     __field(u8, shared)
+			     __field(u8, nr_frags)
+			     __field(u8, frag_list)
+		    ),
+	    TP_fast_assign(__entry->skb = skb;
+			   __entry->skb_len = skb->len;
+			   __entry->data_len = skb->data_len;
+			   __entry->share_ok = share_ok;
+			   __entry->head_frag = skb->head_frag;
+			   __entry->pp_recycle = skb->pp_recycle;
+			   __entry->cloned = skb_cloned(skb);
+			   __entry->shared = skb_shared(skb);
+			   __entry->nr_frags = skb_shinfo(skb)->nr_frags;
+			   __entry->frag_list = (bool)skb_shinfo(skb)->frag_list;
+		    ),
+	    TP_printk("INGRPSTQ: NTHADD: skb=%p len=%u data_len=%u share_ok=%u head_frag=%u pp_recycle=%u cloned=%u shared=%u nr_frags=%u frag_list=%u",
+		      __entry->skb,
+		      __entry->skb_len,
+		      __entry->data_len,
+		      __entry->share_ok,
+		      __entry->head_frag,
+		      __entry->pp_recycle,
+		      __entry->cloned,
+		      __entry->shared,
+		      __entry->nr_frags,
+		      __entry->frag_list
+		    ));
+
+DECLARE_EVENT_CLASS(iptfs_timer_event,
+		    TP_PROTO(struct xfrm_iptfs_data *xtfs, u64 time_val),
+		    TP_ARGS(xtfs, time_val),
+		    TP_STRUCT__entry(
+			__field(u64, time_val)
+			__field(u64, set_time)),
+		    TP_fast_assign(
+			__entry->time_val = time_val;
+			__entry->set_time = xtfs->iptfs_settime;
+		    ),
+		    TP_printk("TIMER: set_time=%llu time_val=%llu", __entry->set_time, __entry->time_val));
+
+DEFINE_EVENT(iptfs_timer_event, iptfs_timer_start,
+	     TP_PROTO(struct xfrm_iptfs_data *xtfs, u64 time_val),
+	     TP_ARGS(xtfs, time_val));
+
+DEFINE_EVENT(iptfs_timer_event, iptfs_timer_expire,
+	     TP_PROTO(struct xfrm_iptfs_data *xtfs, u64 time_val),
+	     TP_ARGS(xtfs, time_val));
+
+#endif /* _TRACE_IPTFS_H */
+
+/* This part must be outside protection */
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH ../../net/xfrm
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace_iptfs
+#include <trace/define_trace.h>
diff --git a/net/xfrm/xfrm_iptfs.c b/net/xfrm/xfrm_iptfs.c
new file mode 100644
index 000000000000..61d460dfb01a
--- /dev/null
+++ b/net/xfrm/xfrm_iptfs.c
@@ -0,0 +1,2735 @@
+// SPDX-License-Identifier: GPL-2.0
+/* xfrm_iptfs: IPTFS encapsulation support
+ *
+ * April 21 2022, Christian Hopps <chopps@labn.net>
+ *
+ * Copyright (c) 2022, LabN Consulting, L.L.C.
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/icmpv6.h>
+#include <net/gro.h>
+#include <net/icmp.h>
+#include <net/ip6_route.h>
+#include <net/inet_ecn.h>
+#include <net/iptfs.h>
+#include <net/xfrm.h>
+
+#include <crypto/aead.h>
+
+#include "xfrm_inout.h"
+#include "trace_iptfs.h"
+
+/* 1) skb->head should be cache aligned.
+ * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
+ * start -16 from data.
+ * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
+ * we want data to be cache line aligned so all the pushed headers will be in
+ * another cacheline.
+ */
+#define XFRM_IPTFS_MIN_L3HEADROOM 128
+#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
+#define IPTFS_FRAG_COPY_MAX 256	/* max for copying to create iptfs frags */
+#define IPTFS_PKT_SHARE_MIN 129	/* min to try to share vs copy pkt data */
+#define NSECS_IN_USEC 1000
+
+#define IPTFS_TYPE_NOCC 0
+#define IPTFS_TYPE_CC 1
+
+#define IPTFS_HRTIMER_MODE HRTIMER_MODE_REL_SOFT
+
+struct skb_wseq {
+	struct sk_buff *skb;
+	u64 drop_time;
+};
+
+struct xfrm_iptfs_config {
+	bool dont_frag : 1;
+	u16 reorder_win_size;
+	u32 pkt_size;	    /* outer_packet_size or 0 */
+	u32 max_queue_size; /* octets */
+	u64 init_delay_us;  /* microseconds */
+	u32 drop_time_us;   /* microseconds */
+};
+
+struct xfrm_iptfs_data {
+	struct xfrm_iptfs_config cfg;
+
+	/* Ingress User Input */
+	struct xfrm_state *x;	    /* owning state */
+	struct sk_buff_head queue;  /* output queue */
+	u32 queue_size;		    /* octets */
+	u32 ecn_queue_size;	    /* octets above which ECN mark */
+	u64 init_delay_ns;	    /* nanoseconds */
+	struct hrtimer iptfs_timer; /* output timer */
+	time64_t iptfs_settime;	    /* time timer was set */
+	u32 payload_mtu;	    /* max payload size */
+
+	/* Tunnel input reordering */
+	bool w_seq_set;		  /* true after first seq received */
+	u64 w_wantseq;		  /* expected next sequence */
+	struct skb_wseq *w_saved; /* the saved buf array */
+	u32 w_savedlen;		  /* the saved len (not size) */
+	spinlock_t drop_lock;
+	struct hrtimer drop_timer;
+	u64 drop_time_ns;
+
+	/* Tunnel input reassembly */
+	struct sk_buff *ra_newskb; /* new pkt being reassembled */
+	u64 ra_wantseq;		   /* expected next sequence */
+	u8 ra_runt[6];		   /* last pkt bytes from last skb */
+	u8 ra_runtlen;		   /* count of ra_runt */
+};
+
+static u32 __iptfs_get_inner_mtu(struct xfrm_state *x, int outer_mtu);
+static enum hrtimer_restart iptfs_delay_timer(struct hrtimer *me);
+static enum hrtimer_restart iptfs_drop_timer(struct hrtimer *me);
+
+/* ================= */
+/* Utility Functions */
+/* ================= */
+
+static inline u32 __trace_ip_proto(struct iphdr *iph)
+{
+	if (iph->version == 4)
+		return iph->protocol;
+	return ((struct ipv6hdr *)iph)->nexthdr;
+}
+
+static u32 __trace_ip_proto_seq(struct iphdr *iph)
+{
+	void *nexthdr;
+	u32 protocol = 0;
+
+	if (iph->version == 4) {
+		nexthdr = (void *)(iph + 1);
+		protocol = iph->protocol;
+	} else if (iph->version == 6) {
+		nexthdr = (void *)(((struct ipv6hdr *)(iph)) + 1);
+		protocol = ((struct ipv6hdr *)(iph))->nexthdr;
+	}
+	switch (protocol) {
+	case IPPROTO_ICMP:
+		return ntohs(((struct icmphdr *)nexthdr)->un.echo.sequence);
+	case IPPROTO_ICMPV6:
+		return ntohs(((struct icmp6hdr *)nexthdr)->icmp6_sequence);
+	case IPPROTO_TCP:
+		return ntohl(((struct tcphdr *)nexthdr)->seq);
+	case IPPROTO_UDP:
+		return ntohs(((struct udphdr *)nexthdr)->source);
+	default:
+		return 0;
+	}
+}
+
+static inline u64 __esp_seq(struct sk_buff *skb)
+{
+	u64 seq = ntohl(XFRM_SKB_CB(skb)->seq.input.low);
+
+	return seq | (u64)ntohl(XFRM_SKB_CB(skb)->seq.input.hi) << 32;
+}
+
+/* ================= */
+/* SK_BUFF Functions */
+/* ================= */
+
+/**
+ * iptfs_alloc_skb() - Allocate a new `skb` using a meta-data template.
+ * @tpl: the template to copy the new `skb`s meta-data from.
+ * @len: the linear length of the head data, zero is fine.
+ * @l3resv: true if reserve needs to support pushing L3 headers
+ *
+ * A new `skb` is allocated and it's meta-data is initialized from `tpl`, the
+ * head data is sized to `len` + reserved space set according to the @l3resv
+ * boolean. When @l3resv is false, resv is XFRM_IPTFS_MIN_L2HEADROOM which
+ * arranges for `skb->data - 16` (etherhdr space) to be the start of a cacheline.
+ * Otherwise, @l3resv is true and resv is either the size of headroom from `tpl` or
+ * XFRM_IPTFS_MIN_L3HEADROOM whichever is greater, which tries to align
+ * skb->data to a cacheline as all headers will be pushed on the previous
+ * cacheline bytes.
+ *
+ * When copying meta-data from the @tpl, the sk_buff->headers are not copied.
+ *
+ * Zero length skbs are allocated when we only need a head skb to hold new
+ * packet headers (basically the mac header) that sit on top of existing shared
+ * packet data.
+ *
+ * Return: the new skb or NULL.
+ */
+static struct sk_buff *iptfs_alloc_skb(struct sk_buff *tpl, u32 len,
+				       bool l3resv)
+{
+	struct sk_buff *skb;
+	u32 resv;
+
+	if (!l3resv) {
+		resv = XFRM_IPTFS_MIN_L2HEADROOM;
+	} else {
+		resv = skb_headroom(tpl);
+		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
+			resv = XFRM_IPTFS_MIN_L3HEADROOM;
+	}
+
+	skb = alloc_skb(len + resv, GFP_ATOMIC);
+	if (!skb) {
+		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMINERROR);
+		return NULL;
+	}
+
+	skb_reserve(skb, resv);
+
+	/* Code from __copy_skb_header() -- we do not want any of the
+	 * tpl->headers copied over, so we aren't using `skb_copy_header()`.
+	 */
+	skb->tstamp = tpl->tstamp;
+	skb->dev = tpl->dev;
+	memcpy(skb->cb, tpl->cb, sizeof(skb->cb));
+	skb_dst_copy(skb, tpl);
+	__skb_ext_copy(skb, tpl);
+	__nf_copy(skb, tpl, false);
+
+	return skb;
+}
+
+/**
+ * skb_head_to_frag() - initialize a skb_frag_t based on skb head data
+ */
+static void skb_head_to_frag(const struct sk_buff *skb, skb_frag_t *frag)
+{
+	struct page *page = virt_to_head_page(skb->data);
+	unsigned char *addr = (unsigned char *)page_address(page);
+
+	BUG_ON(!skb->head_frag);
+	skb_frag_fill_page_desc(frag, page, skb->data - addr, skb_headlen(skb));
+}
+
+/**
+ * struct skb_frag_walk - use to track a walk through fragments
+ * @fragi: current fragment index
+ * @past: length of data in fragments before @fragi
+ * @total: length of data in all fragments
+ * @nr_frags: number of fragments present in array
+ * @initial_offset: the value passed in to skb_prepare_frag_walk()
+ * @pp_recycle: copy of skb->pp_recycle
+ * @frags: the page fragments inc. room for head page
+ */
+struct skb_frag_walk {
+	u32 fragi;
+	u32 past;
+	u32 total;
+	u32 nr_frags;
+	u32 initial_offset;
+	bool pp_recycle;
+	skb_frag_t frags[MAX_SKB_FRAGS + 1];
+};
+
+/**
+ * skb_prepare_frag_walk() - initialize a frag walk over an skb.
+ * @skb: the skb to walk.
+ * @initial_offset: start the walk @initial_offset into the skb.
+ * @walk: the walk to initialize
+ *
+ * Future calls to skb_add_frags() will expect the @offset value to be at
+ * least @initial_offset large.
+ */
+static void skb_prepare_frag_walk(struct sk_buff *skb, u32 initial_offset,
+				  struct skb_frag_walk *walk)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	skb_frag_t *frag, *from;
+	u32 i;
+
+	walk->initial_offset = initial_offset;
+	walk->fragi = 0;
+	walk->past = 0;
+	walk->total = 0;
+	walk->nr_frags = 0;
+	walk->pp_recycle = skb->pp_recycle;
+
+	if (skb->head_frag) {
+		if (initial_offset >= skb_headlen(skb)) {
+			initial_offset -= skb_headlen(skb);
+		} else {
+			frag = &walk->frags[walk->nr_frags++];
+			skb_head_to_frag(skb, frag);
+			frag->bv_offset += initial_offset;
+			frag->bv_len -= initial_offset;
+			walk->total += frag->bv_len;
+			initial_offset = 0;
+		}
+	} else {
+		BUG_ON(skb_headlen(skb) > initial_offset);
+		initial_offset -= skb_headlen(skb);
+	}
+
+	for (i = 0; i < shinfo->nr_frags; i++) {
+		from = &shinfo->frags[i];
+		if (initial_offset >= from->bv_len) {
+			initial_offset -= from->bv_len;
+			continue;
+		}
+		frag = &walk->frags[walk->nr_frags++];
+		*frag = *from;
+		if (initial_offset) {
+			frag->bv_offset += initial_offset;
+			frag->bv_len -= initial_offset;
+			initial_offset = 0;
+		}
+		walk->total += frag->bv_len;
+	}
+	BUG_ON(initial_offset != 0);
+}
+
+static u32 __skb_reset_frag_walk(struct skb_frag_walk *walk, u32 offset)
+{
+	/* Adjust offset to refer to internal walk values */
+	BUG_ON(offset < walk->initial_offset);
+	offset -= walk->initial_offset;
+
+	/* Get to the correct fragment for offset */
+	while (offset < walk->past) {
+		walk->past -= walk->frags[--walk->fragi].bv_len;
+		if (offset >= walk->past)
+			break;
+		BUG_ON(walk->fragi == 0);
+	}
+	while (offset >= walk->past + walk->frags[walk->fragi].bv_len)
+		walk->past += walk->frags[walk->fragi++].bv_len;
+
+	/* offset now relative to this current frag */
+	offset -= walk->past;
+	return offset;
+}
+
+/**
+ * skb_can_add_frags() - check if ok to add frags from walk to skb
+ * @skb: skb to check for adding frags to
+ * @walk: the walk that will be used as source for frags.
+ * @offset: offset from beginning of original skb to start from.
+ * @len: amount of data to add frag references to in @skb.
+ */
+static bool skb_can_add_frags(const struct sk_buff *skb,
+			      struct skb_frag_walk *walk, u32 offset, u32 len)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	u32 fragi, nr_frags, fraglen;
+
+	if (skb_has_frag_list(skb) || skb->pp_recycle != walk->pp_recycle)
+		return false;
+
+	/* Make offset relative to current frag after setting that */
+	offset = __skb_reset_frag_walk(walk, offset);
+
+	/* Verify we have array space for the fragments we need to add */
+	fragi = walk->fragi;
+	nr_frags  = shinfo->nr_frags;
+	while (len && fragi < walk->nr_frags) {
+		skb_frag_t *frag = &walk->frags[fragi];
+
+		fraglen = frag->bv_len;
+		if (offset) {
+			fraglen -= offset;
+			offset = 0;
+		}
+		if (++nr_frags > MAX_SKB_FRAGS)
+			return false;
+		if (len <= fraglen)
+			return true;
+		len -= fraglen;
+		fragi++;
+	}
+	/* We may not copy all @len but what we have will fit. */
+	return true;
+}
+
+/**
+ * skb_add_frags() - add a range of fragment references into an skb
+ * @skb: skb to add references into
+ * @walk: the walk to add referenced fragments from.
+ * @offset: offset from beginning of original skb to start from.
+ * @len: amount of data to add frag references to in @skb.
+ *
+ * skb_can_add_frags() should be called before this function to verify that the
+ * destination @skb is compatible with the walk and has space in the array for
+ * the to be added frag refrences.
+ *
+ * Return: The number of bytes not added to @skb b/c we reached the end of the
+ * walk before adding all of @len.
+ */
+static int skb_add_frags(struct sk_buff *skb, struct skb_frag_walk *walk,
+			 u32 offset, u32 len)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	u32 fraglen;
+
+	BUG_ON(skb->pp_recycle != walk->pp_recycle);
+	if (!walk->nr_frags || offset >= walk->total + walk->initial_offset)
+		return len;
+
+	/* make offset relative to current frag after setting that */
+	offset = __skb_reset_frag_walk(walk, offset);
+	BUG_ON(shinfo->nr_frags >= MAX_SKB_FRAGS);
+
+	while (len && walk->fragi < walk->nr_frags) {
+		skb_frag_t *frag = &walk->frags[walk->fragi];
+		skb_frag_t *tofrag = &shinfo->frags[shinfo->nr_frags];
+
+		*tofrag = *frag;
+		if (offset) {
+			tofrag->bv_offset += offset;
+			tofrag->bv_len -= offset;
+			offset = 0;
+		}
+		__skb_frag_ref(tofrag);
+		shinfo->nr_frags++;
+		BUG_ON(shinfo->nr_frags > MAX_SKB_FRAGS);
+
+		/* see if we are done */
+		fraglen = tofrag->bv_len;
+		if (len < fraglen) {
+			tofrag->bv_len = len;
+			skb->len += len;
+			skb->data_len += len;
+			return 0;
+		}
+		/* advance to next source fragment */
+		len -= fraglen;		  /* careful, use dst bv_len */
+		skb->len += fraglen;	  /* careful, "   "    "     */
+		skb->data_len += fraglen; /* careful, "   "    "     */
+		walk->past +=
+			frag->bv_len; /* careful, use src bv_len */
+		walk->fragi++;
+	}
+	return len;
+}
+
+/**
+ * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
+ * @st: source skb_seq_state
+ * @offset: offset in source
+ * @to: destination buffer
+ * @len: number of bytes to copy
+ *
+ * Copy @len bytes from @offset bytes into the source @st to the destination
+ * buffer @to. `offset` should increase (or be unchanged) with each subsequent
+ * call to this function. If offset needs to decrease from the previous use `st`
+ * should be reset first.
+ */
+static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to, int len)
+{
+	const u8 *data;
+	u32 sqlen;
+
+	for (;;) {
+		sqlen = skb_seq_read(offset, &data, st);
+		if (sqlen == 0)
+			return -ENOMEM;
+		if (sqlen >= len) {
+			memcpy(to, data, len);
+			return 0;
+		}
+		memcpy(to, data, sqlen);
+		to += sqlen;
+		offset += sqlen;
+		len -= sqlen;
+	}
+}
+
+/* ================================== */
+/* IPTFS Trace Event Definitions      */
+/* ================================== */
+
+#define CREATE_TRACE_POINTS
+#include "trace_iptfs.h"
+
+/* ================================== */
+/* IPTFS Receiving (egress) Functions */
+/* ================================== */
+
+/**
+ * iptfs_pskb_add_frags() - Create and add frags into a new sk_buff.
+ * @tpl: template to create new skb from.
+ * @walk: The source for fragments to add.
+ * @off: The offset into @walk to add frags from, also used with @st and
+ *       @copy_len.
+ * @len: The length of data to add covering frags from @walk into @skb.
+ *       This must be <= @skblen.
+ * @st: The sequence state to copy from into the new head skb.
+ * @copy_len: Copy @copy_len bytes from @st at offset @off into the new skb
+ *            linear space.
+ *
+ * Create a new sk_buff `skb` using the template @tpl. Copy @copy_len bytes from
+ * @st into the new skb linear space, and then add shared fragments from the
+ * frag walk for the remaining @len of data (i.e., @len - @copy_len bytes).
+ *
+ * Return: The newly allocated sk_buff `skb` or NULL if an error occurs.
+ */
+struct sk_buff *iptfs_pskb_add_frags(struct sk_buff *tpl,
+				     struct skb_frag_walk *walk, u32 off,
+				     u32 len, struct skb_seq_state *st,
+				     u32 copy_len)
+{
+	struct sk_buff *skb;
+
+	skb = iptfs_alloc_skb(tpl, copy_len, false);
+	if (!skb)
+		return NULL;
+
+	/* this should not normally be happening */
+	if (!skb_can_add_frags(skb, walk, off + copy_len, len - copy_len)) {
+		kfree_skb(skb);
+		return NULL;
+	}
+
+	if (copy_len &&
+	    skb_copy_bits_seq(st, off, skb_put(skb, copy_len), copy_len)) {
+		XFRM_INC_STATS(dev_net(st->root_skb->dev),
+			       LINUX_MIB_XFRMINERROR);
+		kfree_skb(skb);
+		return NULL;
+	}
+
+	skb_add_frags(skb, walk, off + copy_len, len - copy_len);
+	return skb;
+}
+
+/**
+ * iptfs_pskb_extract_seq() - Create and load data into a new sk_buff.
+ * @skblen: the total data size for `skb`.
+ * @st: The source for the rest of the data to copy into `skb`.
+ * @off: The offset into @st to copy data from.
+ * @len: The length of data to copy from @st into `skb`. This must be <=
+ *       @skblen.
+ *
+ * Create a new sk_buff `skb` with @skblen of packet data space. If non-zero,
+ * copy @rlen bytes of @runt into `skb`. Then using seq functions copy @len
+ * bytes from @st into `skb` starting from @off.
+ *
+ * It is an error for @len to be greater than the amount of data left in @st.
+ *
+ * Return: The newly allocated sk_buff `skb` or NULL if an error occurs.
+ */
+static struct sk_buff *
+iptfs_pskb_extract_seq(u32 skblen, struct skb_seq_state *st, u32 off, int len)
+{
+	struct sk_buff *skb = iptfs_alloc_skb(st->root_skb, skblen, false);
+
+	if (!skb)
+		return NULL;
+	if (skb_copy_bits_seq(st, off, skb_put(skb, len), len)) {
+		XFRM_INC_STATS(dev_net(st->root_skb->dev),
+			       LINUX_MIB_XFRMINERROR);
+		kfree_skb(skb);
+		return NULL;
+	}
+	return skb;
+}
+
+/**
+ * iptfs_input_save_runt() - save data in xtfs runt space.
+ *
+ * Save the small (`len`) start of a fragmented packet in `buf` in the xtfs data
+ * runt space.
+ */
+static inline void iptfs_input_save_runt(struct xfrm_iptfs_data *xtfs, u64 seq,
+					 u8 *buf, int len)
+{
+	BUG_ON(xtfs->ra_newskb); /* we won't have a new SKB yet */
+
+	memcpy(xtfs->ra_runt, buf, len);
+
+	xtfs->ra_runtlen = len;
+	xtfs->ra_wantseq = seq + 1;
+}
+
+/**
+ * __iptfs_iphlen() - return the v4/v6 header length using packet data.
+ *
+ * The version data is expected to be valid (i.e., either 4 or 6).
+ */
+static u32 __iptfs_iphlen(u8 *data)
+{
+	struct iphdr *iph = (struct iphdr *)data;
+
+	if (iph->version == 0x4)
+		return sizeof(*iph);
+	BUG_ON(iph->version != 0x6);
+	return sizeof(struct ipv6hdr);
+}
+
+/**
+ * __iptfs_iplen() - return the v4/v6 length using packet data.
+ *
+ * Grab the IPv4 or IPv6 length value in the start of the inner packet header
+ * pointed to by `data`. Assumes data len is enough for the length field only.
+ *
+ * The version data is expected to be valid (i.e., either 4 or 6).
+ */
+static u32 __iptfs_iplen(u8 *data)
+{
+	struct iphdr *iph = (struct iphdr *)data;
+
+	if (iph->version == 0x4)
+		return ntohs(iph->tot_len);
+	BUG_ON(iph->version != 0x6);
+	return ntohs(((struct ipv6hdr *)iph)->payload_len) +
+	       sizeof(struct ipv6hdr);
+}
+
+/**
+ * iptfs_complete_inner_skb() - finish preparing the inner packet for gro recv.
+ *
+ * Finish the standard xfrm processing on the inner packet prior to sending back
+ * through gro_cells_receive. We do this separately b/c we are building a list
+ * of packets in the hopes that one day a list will be taken by
+ * xfrm_input.
+ */
+static void iptfs_complete_inner_skb(struct xfrm_state *x, struct sk_buff *skb)
+{
+	skb_reset_network_header(skb);
+
+	/* The packet is going back through gro_cells_receive no need to
+	 * set this.
+	 */
+	skb_reset_transport_header(skb);
+
+	/* Packet already has checksum value set. */
+	skb->ip_summed = CHECKSUM_NONE;
+
+	/* Our skb will contain the header data copied when this outer packet
+	 * which contained the start of this inner packet. This is true
+	 * when we allocate a new skb as well as when we reuse the existing skb.
+	 */
+	if (ip_hdr(skb)->version == 0x4) {
+		struct iphdr *iph = ip_hdr(skb);
+
+		if (x->props.flags & XFRM_STATE_DECAP_DSCP)
+			ipv4_copy_dscp(XFRM_MODE_SKB_CB(skb)->tos, iph);
+		if (!(x->props.flags & XFRM_STATE_NOECN))
+			if (INET_ECN_is_ce(XFRM_MODE_SKB_CB(skb)->tos))
+				IP_ECN_set_ce(iph);
+
+		skb->protocol = htons(ETH_P_IP);
+	} else {
+		struct ipv6hdr *iph = ipv6_hdr(skb);
+
+		if (x->props.flags & XFRM_STATE_DECAP_DSCP)
+			ipv6_copy_dscp(XFRM_MODE_SKB_CB(skb)->tos, iph);
+		if (!(x->props.flags & XFRM_STATE_NOECN))
+			if (INET_ECN_is_ce(XFRM_MODE_SKB_CB(skb)->tos))
+				IP6_ECN_set_ce(skb, iph);
+
+		skb->protocol = htons(ETH_P_IPV6);
+	}
+}
+
+static inline void __iptfs_reassem_done(struct xfrm_iptfs_data *xtfs, bool free)
+{
+	int ret;
+
+	assert_spin_locked(&xtfs->drop_lock);
+
+	/* We don't care if it works locking takes care of things */
+	ret = hrtimer_try_to_cancel(&xtfs->drop_timer);
+	if (free)
+		kfree_skb(xtfs->ra_newskb);
+	xtfs->ra_newskb = NULL;
+}
+
+/**
+ * iptfs_reassem_done() - In-progress packet is aborted free the state.
+ */
+static inline void iptfs_reassem_abort(struct xfrm_iptfs_data *xtfs)
+{
+	__iptfs_reassem_done(xtfs, true);
+}
+
+/**
+ * iptfs_reassem_done() - In-progress packet is complete, clear the state.
+ */
+static inline void iptfs_reassem_done(struct xfrm_iptfs_data *xtfs)
+{
+	__iptfs_reassem_done(xtfs, false);
+}
+
+/**
+ * iptfs_reassem_cont() - Continue the reassembly of an inner packets.
+ *
+ * Process an IPTFS payload that has a non-zero `blkoff` or when we are
+ * expecting the continuation b/c we have a runt or in-progress packet.
+ */
+static u32 iptfs_reassem_cont(struct xfrm_iptfs_data *xtfs, u64 seq,
+			      struct skb_seq_state *st, struct sk_buff *skb,
+			      u32 data, u32 blkoff, struct list_head *list)
+{
+	struct skb_frag_walk _fragwalk;
+	struct skb_frag_walk *fragwalk = NULL;
+	struct sk_buff *newskb = xtfs->ra_newskb;
+	u32 remaining = skb->len - data;
+	u32 runtlen = xtfs->ra_runtlen;
+	u32 copylen, fraglen, ipremain, iphlen, iphremain, rrem;
+
+	/* Handle packet fragment we aren't expecting */
+	if (!runtlen && !xtfs->ra_newskb)
+		return data + min(blkoff, remaining);
+
+	/* Important to remember that input to this function is an ordered
+	 * packet stream (unless the user disabled the reorder window). Thus if
+	 * we are waiting for, and expecting the next packet so we can continue
+	 * assembly, a newer sequence number indicates older ones are not coming
+	 * (or if they do should be ignored). Technically we can receive older
+	 * ones when the reorder window is disabled; however, the user should
+	 * have disabled fragmentation in this case, and regardless we don't
+	 * deal with it.
+	 *
+	 * blkoff could be zero if the stream is messed up (or it's an all pad
+	 * insertion) be careful to handle that case in each of the below
+	 */
+
+	/* Too old case: This can happen when the reorder window is disabled so
+	 * ordering isn't actually guaranteed.
+	 */
+	if (seq < xtfs->ra_wantseq)
+		return data + remaining;
+
+	/* Too new case: We missed what we wanted cleanup. */
+	if (seq > xtfs->ra_wantseq) {
+		XFRM_INC_STATS(dev_net(skb->dev),
+			       LINUX_MIB_XFRMINIPTFSERROR);
+		goto abandon;
+	}
+
+	if (blkoff == 0) {
+		if ((*skb->data & 0xF0) != 0) {
+			XFRM_INC_STATS(dev_net(skb->dev),
+				       LINUX_MIB_XFRMINIPTFSERROR);
+			goto abandon;
+		}
+		/* Handle all pad case, advance expected sequence number.
+		 * (RFC 9347 S2.2.3)
+		 */
+		xtfs->ra_wantseq++;
+		/* will end parsing */
+		return data + remaining;
+	}
+
+	if (runtlen) {
+		BUG_ON(xtfs->ra_newskb);
+
+		/* Regardless of what happens we're done with the runt */
+		xtfs->ra_runtlen = 0;
+
+		/* The start of this inner packet was at the very end of the last
+		 * iptfs payload which didn't include enough for the ip header
+		 * length field. We must have *at least* that now.
+		 */
+		rrem = sizeof(xtfs->ra_runt) - runtlen;
+		if (remaining < rrem || blkoff < rrem) {
+			XFRM_INC_STATS(dev_net(skb->dev),
+				       LINUX_MIB_XFRMINIPTFSERROR);
+			goto abandon;
+		}
+
+		/* fill in the runt data */
+		if (skb_copy_bits_seq(st, data, &xtfs->ra_runt[runtlen],
+				      rrem)) {
+			XFRM_INC_STATS(dev_net(skb->dev),
+				       LINUX_MIB_XFRMINBUFFERERROR);
+			goto abandon;
+		}
+
+		/* We have enough data to get the ip length value now,
+		 * allocate an in progress skb
+		 */
+		ipremain = __iptfs_iplen(xtfs->ra_runt);
+		if (ipremain < sizeof(xtfs->ra_runt)) {
+			/* length has to be at least runtsize large */
+			XFRM_INC_STATS(dev_net(skb->dev),
+				       LINUX_MIB_XFRMINIPTFSERROR);
+			goto abandon;
+		}
+
+		/* For the runt case we don't attempt sharing currently. NOTE:
+		 * Currently, this IPTFS implementation will not create runts.
+		 */
+
+		newskb = iptfs_alloc_skb(skb, ipremain, false);
+		if (!newskb) {
+			XFRM_INC_STATS(dev_net(skb->dev),
+				       LINUX_MIB_XFRMINERROR);
+			goto abandon;
+		}
+		xtfs->ra_newskb = newskb;
+
+		/* Copy the runt data into the buffer, but leave data
+		 * pointers the same as normal non-runt case. The extra `rrem`
+		 * recopied bytes are basically cacheline free. Allows using
+		 * same logic below to complete.
+		 */
+		memcpy(skb_put(newskb, runtlen), xtfs->ra_runt,
+		       sizeof(xtfs->ra_runt));
+	}
+
+	/* Continue reassembling the packet */
+	ipremain = __iptfs_iplen(newskb->data);
+	iphlen = __iptfs_iphlen(newskb->data);
+
+	/* Sanity check, we created the newskb knowing the IP length so the IP
+	 * length can't now be shorter.
+	 */
+	BUG_ON(newskb->len > ipremain);
+
+	ipremain -= newskb->len;
+	if (blkoff < ipremain) {
+		/* Corrupt data, we don't have enough to complete the packet */
+		XFRM_INC_STATS(dev_net(skb->dev), LINUX_MIB_XFRMINIPTFSERROR);
+		goto abandon;
+	}
+
+	/* We want the IP header in linear space */
+	if (newskb->len < iphlen)  {
+		iphremain = iphlen - newskb->len;
+		if (blkoff < iphremain) {
+			XFRM_INC_STATS(dev_net(skb->dev),
+				       LINUX_MIB_XFRMINIPTFSERROR);
+			goto abandon;
+		}
+		fraglen = min(blkoff, remaining);
+		copylen = min(fraglen, iphremain);
+		BUG_ON(skb_tailroom(newskb) < copylen);
+		if (skb_copy_bits_seq(st, data, skb_put(newskb, copylen), copylen)) {
+			XFRM_INC_STATS(dev_net(skb->dev),
+				       LINUX_MIB_XFRMINBUFFERERROR);
+			goto abandon;
+		}
+		/* this is a silly condition that might occur anyway */
+		if (copylen < iphremain) {
+			xtfs->ra_wantseq++;
+			return data + fraglen;
+		}
+		/* update data and things derived from it */
+		data += copylen;
+		blkoff -= copylen;
+		remaining -= copylen;
+		ipremain -= copylen;
+	}
+
+	fraglen = min(blkoff, remaining);
+	copylen = min(fraglen, ipremain);
+
+	/* If we may have the opportunity to share prepare a fragwalk. */
+	if (!skb_has_frag_list(skb) && !skb_has_frag_list(newskb) &&
+	    (skb->head_frag || skb->len == skb->data_len) &&
+	    skb->pp_recycle == newskb->pp_recycle) {
+		fragwalk = &_fragwalk;
+		skb_prepare_frag_walk(skb, data, fragwalk);
+	}
+
+	/* Try share then copy. */
+	if (fragwalk && skb_can_add_frags(newskb, fragwalk, data, copylen)) {
+		u32 leftover;
+
+		leftover = skb_add_frags(newskb, fragwalk, data, copylen);
+		BUG_ON(leftover != 0);
+	} else {
+		/* We verified this was true in the main receive routine */
+		BUG_ON(skb_tailroom(newskb) < copylen);
+
+		/* copy fragment data into newskb */
+		if (skb_copy_bits_seq(st, data, skb_put(newskb, copylen), copylen)) {
+			XFRM_INC_STATS(dev_net(skb->dev),
+				       LINUX_MIB_XFRMINBUFFERERROR);
+			goto abandon;
+		}
+	}
+
+	if (copylen < ipremain) {
+		xtfs->ra_wantseq++;
+	} else {
+		/* We are done with packet reassembly! */
+		BUG_ON(copylen != ipremain);
+		iptfs_reassem_done(xtfs);
+		iptfs_complete_inner_skb(xtfs->x, newskb);
+		list_add_tail(&newskb->list, list);
+	}
+
+	/* will continue on to new data block or end */
+	return data + fraglen;
+
+abandon:
+	if (xtfs->ra_newskb) {
+		iptfs_reassem_abort(xtfs);
+	} else {
+		xtfs->ra_runtlen = 0;
+		xtfs->ra_wantseq = 0;
+	}
+	/* skip past fragment, maybe to end */
+	return data + min(blkoff, remaining);
+}
+
+/**
+ * iptfs_input_ordered() - handle next in order IPTFS payload.
+ *
+ * Process the IPTFS payload in `skb` and consume it afterwards.
+ */
+static int iptfs_input_ordered(struct xfrm_state *x, struct sk_buff *skb)
+{
+	u8 hbytes[sizeof(struct ipv6hdr)];
+	struct ip_iptfs_cc_hdr iptcch;
+	struct skb_seq_state skbseq;
+	struct skb_frag_walk _fragwalk;
+	struct skb_frag_walk *fragwalk = NULL;
+	struct list_head sublist; /* rename this it's just a list */
+	struct sk_buff *first_skb, *defer, *next;
+	const unsigned char *old_mac;
+	struct xfrm_iptfs_data *xtfs;
+	struct ip_iptfs_hdr *ipth;
+	struct iphdr *iph;
+	struct net *net;
+	u32 remaining, first_iplen, iplen, iphlen, data, tail;
+	u32 blkoff, capturelen;
+	u64 seq;
+
+	xtfs = x->mode_data;
+	net = dev_net(skb->dev);
+	first_skb = NULL;
+	defer = NULL;
+
+	seq = __esp_seq(skb);
+
+	/* Large enough to hold both types of header */
+	ipth = (struct ip_iptfs_hdr *)&iptcch;
+
+	/* Save the old mac header if set */
+	old_mac = skb_mac_header_was_set(skb) ? skb_mac_header(skb) : NULL;
+
+	skb_prepare_seq_read(skb, 0, skb->len, &skbseq);
+
+	/* Get the IPTFS header and validate it */
+
+	if (skb_copy_bits_seq(&skbseq, 0, ipth, sizeof(*ipth))) {
+		XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
+		goto done;
+	}
+	data = sizeof(*ipth);
+
+	trace_iptfs_egress_recv(skb, xtfs, htons(ipth->block_offset));
+
+	/* Set data past the basic header */
+	if (ipth->subtype == IPTFS_SUBTYPE_CC) {
+		/* Copy the rest of the CC header */
+		remaining = sizeof(iptcch) - sizeof(*ipth);
+		if (skb_copy_bits_seq(&skbseq, data, ipth + 1, remaining)) {
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
+			goto done;
+		}
+		data += remaining;
+	} else if (ipth->subtype != IPTFS_SUBTYPE_BASIC) {
+		XFRM_INC_STATS(net, LINUX_MIB_XFRMINHDRERROR);
+		goto done;
+	}
+
+	if (ipth->flags != 0) {
+		XFRM_INC_STATS(net, LINUX_MIB_XFRMINHDRERROR);
+		goto done;
+	}
+
+	INIT_LIST_HEAD(&sublist);
+
+	/* Handle fragment at start of payload, and/or waiting reassembly. */
+
+	blkoff = ntohs(ipth->block_offset);
+	/* check before locking i.e., maybe */
+	if (blkoff || xtfs->ra_runtlen || xtfs->ra_newskb) {
+		spin_lock(&xtfs->drop_lock);
+
+		/* check again after lock */
+		if (blkoff || xtfs->ra_runtlen || xtfs->ra_newskb) {
+			data = iptfs_reassem_cont(xtfs, seq, &skbseq, skb, data,
+						  blkoff, &sublist);
+		}
+
+		spin_unlock(&xtfs->drop_lock);
+	}
+
+	/* New packets */
+
+	tail = skb->len;
+	BUG_ON(xtfs->ra_newskb && data < tail);
+
+	while (data < tail) {
+		u32 protocol = 0;
+
+		/* Gather information on the next data block.
+		 * `data` points to the start of the data block.
+		 */
+		remaining = tail - data;
+
+		/* try and copy enough bytes to read length from ipv4/ipv6 */
+		iphlen = min_t(u32, remaining, 6);
+		if (skb_copy_bits_seq(&skbseq, data, hbytes, iphlen)) {
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
+			goto done;
+		}
+
+		iph = (struct iphdr *)hbytes;
+		if (iph->version == 0x4) {
+			/* must have at least tot_len field present */
+			if (remaining < 4) {
+				/* save the bytes we have, advance data and exit */
+				iptfs_input_save_runt(xtfs, seq, hbytes,
+						      remaining);
+				data += remaining;
+				break;
+			}
+
+			iplen = htons(iph->tot_len);
+			iphlen = iph->ihl << 2;
+			protocol = htons(ETH_P_IP);
+			XFRM_MODE_SKB_CB(skbseq.root_skb)->tos = iph->tos;
+		} else if (iph->version == 0x6) {
+			/* must have at least payload_len field present */
+			if (remaining < 6) {
+				/* save the bytes we have, advance data and exit */
+				iptfs_input_save_runt(xtfs, seq, hbytes,
+						      remaining);
+				data += remaining;
+				break;
+			}
+
+			iplen = htons(((struct ipv6hdr *)hbytes)->payload_len);
+			iplen += sizeof(struct ipv6hdr);
+			iphlen = sizeof(struct ipv6hdr);
+			protocol = htons(ETH_P_IPV6);
+			XFRM_MODE_SKB_CB(skbseq.root_skb)->tos =
+				ipv6_get_dsfield((struct ipv6hdr *)iph);
+		} else if (iph->version == 0x0) {
+			/* pad */
+			data = tail;
+			break;
+		} else {
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMINBUFFERERROR);
+			goto done;
+		}
+
+		if (unlikely(skbseq.stepped_offset)) {
+			/* We need to reset our seq read, it can't backup at
+			 * this point.
+			 */
+			struct sk_buff *save = skbseq.root_skb;
+
+			skb_abort_seq_read(&skbseq);
+			skb_prepare_seq_read(save, data, tail, &skbseq);
+		}
+
+		if (first_skb) {
+			skb = NULL;
+		} else {
+			first_skb = skb;
+			first_iplen = iplen;
+			fragwalk = NULL;
+
+			/* We are going to skip over `data` bytes to reach the
+			 * start of the IP header of `iphlen` len for `iplen`
+			 * inner packet.
+			 */
+
+			if (skb_has_frag_list(skb)) {
+				defer = skb;
+				skb = NULL;
+#if 0
+			} else if (data < skb_headlen(skb)) {
+				/* NOTE: Instead of pskb_pull we could just drop
+				 * the head bytes and as many pages as necessary
+				 * to get us pointing at the correct thing.
+				 */
+				pskb_pull(skb, data);
+				/* Reuse fist skb. Need to move past the initial
+				 * iptfs header as well as any initial fragment
+				 * for previous inner packet reassembly.
+				 */
+				skb_pull(skb, data);
+				data = 0;
+				tail = skb->len;
+				remaining = skb->len;
+
+				skb->protocol = protocol;
+				skb_mac_header_rebuild(skb);
+				if (skb->mac_len)
+					eth_hdr(skb)->h_proto = skb->protocol;
+
+				/* all pointers could be changed now -- reset walk */
+				skb_abort_seq_read(&skbseq);
+				skb_prepare_seq_read(skb, data, tail, &skbseq);
+
+			} else if (skb->data_len == 0 &&
+				   skb_tailroom(skb) + (skb->len - data) >= iplen) {
+#endif
+			} else if (data + iphlen <= skb_headlen(skb) &&
+				   /* make sure our header is 32-bit aligned? */
+				   /* ((uintptr_t)(skb->data + data) & 0x3) == 0 && */
+				   skb_tailroom(skb) + tail - data >= iplen) {
+				/* Reuse the received skb.
+				 *
+				 * We have enough headlen to pull past any
+				 * initial fragment data, leaving at least the
+				 * IP header in the linear buffer space.
+				 *
+				 * For linear buffer space we only require that
+				 * linear buffer space is large enough to
+				 * eventually hold the entire reassembled
+				 * packet (by including tailroom in the check).
+				 *
+				 * For non-linear tailroom is 0 and so we only
+				 * re-use if the entire packet is present
+				 * already.
+				 *
+				 * NOTE: there are many more options for
+				 * sharing, KISS for now. Also, this can produce
+				 * skb's with the IP header unaligned to 32
+				 * bits. If that ends up being a problem then a
+				 * check should be added to the conditional
+				 * above that the header lies on a 32-bit
+				 * boundary as well.
+				 */
+				skb_pull(skb, data);
+
+				/* our range just changed */
+				data = 0;
+				tail = skb->len;
+				remaining = skb->len;
+
+				skb->protocol = protocol;
+				skb_mac_header_rebuild(skb);
+				if (skb->mac_len)
+					eth_hdr(skb)->h_proto = skb->protocol;
+
+				/* all pointers could be changed now reset walk */
+				skb_abort_seq_read(&skbseq);
+				skb_prepare_seq_read(skb, data, tail, &skbseq);
+			} else if (skb->head_frag &&
+				   /* We have the IP header right now */
+				   remaining >= iphlen) {
+				fragwalk = &_fragwalk;
+				skb_prepare_frag_walk(skb, data, fragwalk);
+				defer = skb;
+				skb = NULL;
+			} else {
+				/* We couldn't reuse the input skb so allocate a
+				 * new one.
+				 */
+				defer = skb;
+				skb = NULL;
+			}
+
+			/* Don't trim `first_skb` until the end as we are
+			 * walking that data now.
+			 */
+		}
+
+		capturelen = min(iplen, remaining);
+		if (!skb) {
+			if (!fragwalk ||
+			    /* Large enough to be worth sharing */
+			    iplen < IPTFS_PKT_SHARE_MIN ||
+			    /* Have IP header + some data to share. */
+			    capturelen <= iphlen ||
+			    /* Try creating skb and adding frags */
+			    !(skb = iptfs_pskb_add_frags(first_skb, fragwalk,
+							 data, capturelen,
+							 &skbseq, iphlen))) {
+				skb = iptfs_pskb_extract_seq(iplen, &skbseq,
+							     data, capturelen);
+			}
+			if (!skb) {
+				/* skip to next packet or done */
+				data += capturelen;
+				continue;
+			}
+			BUG_ON(skb->len != capturelen);
+
+			skb->protocol = protocol;
+			if (old_mac) {
+				/* rebuild the mac header */
+				skb_set_mac_header(skb, -first_skb->mac_len);
+				memcpy(skb_mac_header(skb), old_mac,
+				       first_skb->mac_len);
+				eth_hdr(skb)->h_proto = skb->protocol;
+			}
+		}
+
+		data += capturelen;
+
+		if (skb->len < iplen) {
+			BUG_ON(data != tail);
+			BUG_ON(xtfs->ra_newskb);
+
+			/* Start reassembly */
+			spin_lock(&xtfs->drop_lock);
+
+			xtfs->ra_newskb = skb;
+			xtfs->ra_wantseq = seq + 1;
+			if (!hrtimer_is_queued(&xtfs->drop_timer)) {
+				/* softirq blocked lest the timer fire and interrupt us */
+				BUG_ON(!in_interrupt());
+				hrtimer_start(&xtfs->drop_timer,
+					      xtfs->drop_time_ns,
+					      IPTFS_HRTIMER_MODE);
+			}
+
+			spin_unlock(&xtfs->drop_lock);
+
+			break;
+		}
+
+		iptfs_complete_inner_skb(x, skb);
+		list_add_tail(&skb->list, &sublist);
+	}
+
+	if (data != tail)
+		/* this should not happen from the above code */
+		XFRM_INC_STATS(dev_net(skb->dev), LINUX_MIB_XFRMINIPTFSERROR);
+
+	if (first_skb && first_iplen && !defer && first_skb != xtfs->ra_newskb) {
+		/* first_skb is queued b/c !defer and not partial */
+		if (pskb_trim(first_skb, first_iplen)) {
+			/* error trimming */
+			list_del(&first_skb->list);
+			defer = first_skb;
+		}
+		first_skb->ip_summed = CHECKSUM_NONE;
+	}
+
+	/* Send the packets! */
+	list_for_each_entry_safe(skb, next, &sublist, list) {
+		BUG_ON(skb == defer);
+		skb_list_del_init(skb);
+		if (xfrm_input(skb, 0, 0, -3))
+			kfree_skb(skb);
+	}
+
+done:
+	skb = skbseq.root_skb;
+	skb_abort_seq_read(&skbseq);
+
+	if (defer) {
+		consume_skb(defer);
+	} else if (!first_skb) {
+		/* skb is the original passed in skb, but we didn't get far
+		 * enough to process it as the first_skb, if we had it would
+		 * either be save in ra_newskb, trimmed and sent on as an skb or
+		 * placed in defer to be freed.
+		 */
+		BUG_ON(!skb);
+		kfree_skb(skb);
+	}
+
+	return 0;
+}
+
+/* ------------------------------- */
+/* Input (Egress) Re-ordering Code */
+/* ------------------------------- */
+
+static void __vec_shift(struct xfrm_iptfs_data *xtfs, u32 shift)
+{
+	u32 savedlen = xtfs->w_savedlen;
+
+	if (shift > savedlen)
+		shift = savedlen;
+	if (shift != savedlen)
+		memcpy(xtfs->w_saved, xtfs->w_saved + shift,
+		       (savedlen - shift) * sizeof(*xtfs->w_saved));
+	memset(xtfs->w_saved + savedlen - shift, 0,
+	       shift * sizeof(*xtfs->w_saved));
+	xtfs->w_savedlen -= shift;
+}
+
+static int __reorder_past(struct xfrm_iptfs_data *xtfs, struct sk_buff *inskb,
+			  struct list_head *freelist, u32 *fcount)
+{
+	list_add_tail(&inskb->list, freelist);
+	(*fcount)++;
+	return 0;
+}
+
+static u32 __reorder_drop(struct xfrm_iptfs_data *xtfs, struct list_head *list)
+
+{
+	struct skb_wseq *s, *se;
+	const u32 savedlen = xtfs->w_savedlen;
+	u64 wantseq = xtfs->w_wantseq;
+	time64_t now = ktime_get_raw_fast_ns();
+	u32 count = 0;
+	u32 scount = 0;
+
+	BUG_ON(!savedlen);
+	if (xtfs->w_saved[0].drop_time > now)
+		goto set_timer;
+
+	wantseq = ++xtfs->w_wantseq;
+
+	/* Keep flushing packets until we reach a drop time greater than now. */
+	s = xtfs->w_saved;
+	se = s + savedlen;
+	do {
+		/* Walking past empty slots until we reach a packet */
+		for (; s < se && !s->skb; s++)
+			if (s->drop_time > now)
+				goto outerdone;
+		/* Sending packets until we hit another empty slot. */
+		for (; s < se && s->skb; scount++, s++)
+			list_add_tail(&s->skb->list, list);
+	} while (s < se);
+outerdone:
+
+	count = s - xtfs->w_saved;
+	if (count) {
+		xtfs->w_wantseq += count;
+
+		/* Shift handled slots plus final empty slot into slot 0. */
+		__vec_shift(xtfs, count);
+	}
+
+	if (xtfs->w_savedlen) {
+set_timer:
+		/* Drifting is OK */
+		hrtimer_start(&xtfs->drop_timer,
+			      xtfs->w_saved[0].drop_time - now,
+			      IPTFS_HRTIMER_MODE);
+	}
+	return scount;
+}
+
+static u32 __reorder_this(struct xfrm_iptfs_data *xtfs, struct sk_buff *inskb,
+			  struct list_head *list)
+
+{
+	struct skb_wseq *s, *se;
+	const u32 savedlen = xtfs->w_savedlen;
+	u64 wantseq = xtfs->w_wantseq;
+	u32 count = 0;
+
+	/* Got what we wanted. */
+	list_add_tail(&inskb->list, list);
+	wantseq = ++xtfs->w_wantseq;
+	if (!savedlen)
+		return 1;
+
+	/* Flush remaining consecutive packets. */
+
+	/* Keep sending until we hit another missed pkt. */
+	for (s = xtfs->w_saved, se = s + savedlen; s < se && s->skb; s++)
+		list_add_tail(&s->skb->list, list);
+	count = s - xtfs->w_saved;
+	if (count)
+		xtfs->w_wantseq += count;
+
+	/* Shift handled slots plus final empty slot into slot 0. */
+	__vec_shift(xtfs, count + 1);
+
+	return count + 1;
+}
+
+/* Set the slot's drop time and all the empty slots below it until reaching a
+ * filled slot which will already be set.
+ */
+static void iptfs_set_window_drop_times(struct xfrm_iptfs_data *xtfs, int index)
+{
+	const u32 savedlen = xtfs->w_savedlen;
+	struct skb_wseq *s = xtfs->w_saved;
+	time64_t drop_time;
+
+	assert_spin_locked(&xtfs->drop_lock);
+
+	if (savedlen > index + 1) {
+		/* we are below another, our drop time and the timer are already set */
+		BUG_ON(xtfs->w_saved[index + 1].drop_time !=
+		       xtfs->w_saved[index].drop_time);
+		return;
+	}
+	/* we are the most future so get a new drop time. */
+	drop_time = ktime_get_raw_fast_ns();
+	drop_time += xtfs->drop_time_ns;
+
+	/* Walk back through the array setting drop times as we go */
+	s[index].drop_time = drop_time;
+	while (index-- > 0 && !s[index].skb)
+		s[index].drop_time = drop_time;
+
+	/* If we walked all the way back, schedule the drop timer if needed */
+	if (index == -1 && !hrtimer_is_queued(&xtfs->drop_timer))
+		hrtimer_start(&xtfs->drop_timer, xtfs->drop_time_ns,
+			      IPTFS_HRTIMER_MODE);
+}
+
+static u32 __reorder_future_fits(struct xfrm_iptfs_data *xtfs,
+				 struct sk_buff *inskb,
+				 struct list_head *freelist, u32 *fcount)
+{
+	const u32 nslots = xtfs->cfg.reorder_win_size + 1;
+	const u64 inseq = __esp_seq(inskb);
+	const u64 wantseq = xtfs->w_wantseq;
+	const u64 distance = inseq - wantseq;
+	const u32 savedlen = xtfs->w_savedlen;
+	const u32 index = distance - 1;
+
+	BUG_ON(distance >= nslots);
+
+	/* Handle future sequence number received which fits in the window.
+	 *
+	 * We know we don't have the seq we want so we won't be able to flush
+	 * anything.
+	 */
+
+	/* slot count is 4, saved size is 3 savedlen is 2
+	 *
+	 * "window boundary" is based on the fixed window size
+	 * distance is also slot number
+	 * index is an array index (i.e., - 1 of slot)
+	 * : : - implicit NULL after array len
+	 *
+	 *          +--------- used length (savedlen == 2)
+	 *          |   +----- array size (nslots - 1 == 3)
+	 *          |   |   + window boundary (nslots == 4)
+	 *          V   V | V
+	 *                |
+	 *  0   1   2   3 |   slot number
+	 * ---  0   1   2 |   array index
+	 *     [-] [b] : :|   array
+	 *
+	 * "2" "3" "4" *5*|   seq numbers
+	 *
+	 * We receive seq number 5
+	 * distance == 3 [inseq(5) - w_wantseq(2)]
+	 * index == 2 [distance(6) - 1]
+	 */
+
+	if (xtfs->w_saved[index].skb) {
+		/* a dup of a future */
+		list_add_tail(&inskb->list, freelist);
+		(*fcount)++;
+		return 0;
+	}
+
+	xtfs->w_saved[index].skb = inskb;
+	xtfs->w_savedlen = max(savedlen, index + 1);
+	iptfs_set_window_drop_times(xtfs, index);
+
+	return 0;
+}
+
+static u32 __reorder_future_shifts(struct xfrm_iptfs_data *xtfs,
+				   struct sk_buff *inskb,
+				   struct list_head *list,
+				   struct list_head *freelist, u32 *fcount)
+{
+	const u32 nslots = xtfs->cfg.reorder_win_size + 1;
+	const u64 inseq = __esp_seq(inskb);
+	u32 savedlen = xtfs->w_savedlen;
+	u64 wantseq = xtfs->w_wantseq;
+	struct sk_buff *slot0 = NULL;
+	u64 last_drop_seq = xtfs->w_wantseq;
+	u64 distance, extra_drops, missed, s0seq;
+	u32 count = 0;
+	struct skb_wseq *wnext;
+	u32 beyond, shifting, slot;
+
+	BUG_ON(inseq <= wantseq);
+	distance = inseq - wantseq;
+	BUG_ON(distance <= nslots - 1);
+	beyond = distance - (nslots - 1);
+	missed = 0;
+
+	/* Handle future sequence number received.
+	 *
+	 * IMPORTANT: we are at least advancing w_wantseq (i.e., wantseq) by 1
+	 * b/c we are beyond the window boundary.
+	 *
+	 * We know we don't have the wantseq so that counts as a drop.
+	 */
+
+	/* ex: slot count is 4, array size is 3 savedlen is 2, slot 0 is the
+	 * missing sequence number.
+	 *
+	 * the final slot at savedlen (index savedlen - 1) is always occupied.
+	 *
+	 * beyond is "beyond array size" not savedlen.
+	 *
+	 *          +--------- array length (savedlen == 2)
+	 *          |   +----- array size (nslots - 1 == 3)
+	 *          |   |   +- window boundary (nslots == 4)
+	 *          V   V | V
+	 *                |
+	 *  0   1   2   3 |   slot number
+	 * ---  0   1   2 |   array index
+	 *     [b] [c] : :|   array
+	 *                |
+	 * "2" "3" "4" "5"|*6*  seq numbers
+	 *
+	 * We receive seq number 6
+	 * distance == 4 [inseq(6) - w_wantseq(2)]
+	 * newslot == distance
+	 * index == 3 [distance(4) - 1]
+	 * beyond == 1 [newslot(4) - lastslot((nslots(4) - 1))]
+	 * shifting == 1 [min(savedlen(2), beyond(1)]
+	 * slot0_skb == [b], and should match w_wantseq
+	 *
+	 *                +--- window boundary (nslots == 4)
+	 *  0   1   2   3 | 4   slot number
+	 * ---  0   1   2 | 3   array index
+	 *     [b] : : : :|     array
+	 * "2" "3" "4" "5" *6*  seq numbers
+	 *
+	 * We receive seq number 6
+	 * distance == 4 [inseq(6) - w_wantseq(2)]
+	 * newslot == distance
+	 * index == 3 [distance(4) - 1]
+	 * beyond == 1 [newslot(4) - lastslot((nslots(4) - 1))]
+	 * shifting == 1 [min(savedlen(1), beyond(1)]
+	 * slot0_skb == [b] and should match w_wantseq
+	 *
+	 *                +-- window boundary (nslots == 4)
+	 *  0   1   2   3 | 4   5   6   slot number
+	 * ---  0   1   2 | 3   4   5   array index
+	 *     [-] [c] : :|             array
+	 * "2" "3" "4" "5" "6" "7" *8*  seq numbers
+	 *
+	 * savedlen = 2, beyond = 3
+	 * iter 1: slot0 == NULL, missed++, lastdrop = 2 (2+1-1), slot0 = [-]
+	 * iter 2: slot0 == NULL, missed++, lastdrop = 3 (2+2-1), slot0 = [c]
+	 * 2 < 3, extra = 1 (3-2), missed += extra, lastdrop = 4 (2+2+1-1)
+	 *
+	 * We receive seq number 8
+	 * distance == 6 [inseq(8) - w_wantseq(2)]
+	 * newslot == distance
+	 * index == 5 [distance(6) - 1]
+	 * beyond == 3 [newslot(6) - lastslot((nslots(4) - 1))]
+	 * shifting == 2 [min(savedlen(2), beyond(3)]
+	 *
+	 * slot0_skb == NULL changed from [b] when "savedlen < beyond" is true.
+	 */
+
+	/* Now send any packets that are being shifted out of saved, and account
+	 * for missing packets that are exiting the window as we shift it.
+	 */
+
+	/* If savedlen > beyond we are shifting some, else all. */
+	shifting = min(savedlen, beyond);
+
+	/* slot0 is the buf that just shifted out and into slot0 */
+	slot0 = NULL;
+	s0seq = wantseq;
+	last_drop_seq = s0seq;
+	wnext = xtfs->w_saved;
+	for (slot = 1; slot <= shifting; slot++, wnext++) {
+		/* handle what was in slot0 before we occupy it */
+		if (!slot0) {
+			last_drop_seq = s0seq;
+			missed++;
+		} else {
+			list_add_tail(&slot0->list, list);
+			count++;
+		}
+		s0seq++;
+		slot0 = wnext->skb;
+		wnext->skb = NULL;
+	}
+
+	/* slot0 is now either NULL (in which case it's what we now are waiting
+	 * for, or a buf in which case we need to handle it like we received it;
+	 * however, we may be advancing past that buffer as well..
+	 */
+
+	/* Handle case where we need to shift more than we had saved, slot0 will
+	 * be NULL iff savedlen is 0, otherwise slot0 will always be
+	 * non-NULL b/c we shifted the final element, which is always set if
+	 * there is any saved, into slot0.
+	 */
+	if (savedlen < beyond) {
+		extra_drops = beyond - savedlen;
+		if (savedlen == 0) {
+			BUG_ON(slot0);
+			s0seq += extra_drops;
+			last_drop_seq = s0seq - 1;
+		} else {
+			extra_drops--; /* we aren't dropping what's in slot0 */
+			BUG_ON(!slot0);
+			list_add_tail(&slot0->list, list);
+			/* if extra_drops then we are going past this slot0
+			 * so we can safely advance last_drop_seq
+			 */
+			if (extra_drops)
+				last_drop_seq = s0seq + extra_drops;
+			s0seq += extra_drops + 1;
+			count++;
+		}
+		missed += extra_drops;
+		slot0 = NULL;
+		/* slot0 has had an empty slot pushed into it */
+	}
+
+	/* Remove the entries */
+	__vec_shift(xtfs, beyond);
+
+	/* Advance want seq */
+	xtfs->w_wantseq += beyond;
+
+	/* Process drops here when implementing congestion control */
+
+	/* We've shifted. plug the packet in at the end. */
+	xtfs->w_savedlen = nslots - 1;
+	xtfs->w_saved[xtfs->w_savedlen - 1].skb = inskb;
+	iptfs_set_window_drop_times(xtfs, xtfs->w_savedlen - 1);
+
+	/* if we don't have a slot0 then we must wait for it */
+	if (!slot0)
+		return count;
+
+	/* If slot0, seq must match new want seq */
+	BUG_ON(xtfs->w_wantseq != __esp_seq(slot0));
+
+	/* slot0 is valid, treat like we received expected. */
+	count += __reorder_this(xtfs, slot0, list);
+	return count;
+}
+
+/* Receive a new packet into the reorder window. Return a list of ordered
+ * packets from the window.
+ */
+static u32 iptfs_input_reorder(struct xfrm_iptfs_data *xtfs,
+			       struct sk_buff *inskb, struct list_head *list,
+			       struct list_head *freelist, u32 *fcount)
+{
+	const u32 nslots = xtfs->cfg.reorder_win_size + 1;
+	u64 inseq = __esp_seq(inskb);
+	u64 wantseq;
+
+	assert_spin_locked(&xtfs->drop_lock);
+
+	if (unlikely(!xtfs->w_seq_set)) {
+		xtfs->w_seq_set = true;
+		xtfs->w_wantseq = inseq;
+	}
+	wantseq = xtfs->w_wantseq;
+
+	if (likely(inseq == wantseq))
+		return __reorder_this(xtfs, inskb, list);
+	else if (inseq < wantseq)
+		return __reorder_past(xtfs, inskb, freelist, fcount);
+	else if ((inseq - wantseq) < nslots)
+		return __reorder_future_fits(xtfs, inskb, freelist, fcount);
+	else
+		return __reorder_future_shifts(xtfs, inskb, list, freelist,
+					       fcount);
+}
+
+/**
+ * iptfs_drop_timer() - Handle drop timer expiry.
+ *
+ * This is similar to our input function.
+ *
+ * The drop timer is set when we start an in progress reassembly, and also when
+ * we save a future packet in the window saved array.
+ *
+ * NOTE packets in the save window are always newer WRT drop times as
+ * they get further in the future. i.e. for:
+ *
+ *    if slots (S0, S1, ... Sn) and `Dn` is the drop time for slot `Sn`,
+ *    then D(n-1) <= D(n).
+ *
+ * So, regardless of why the timer is firing we can always discard any inprogress
+ * fragment; either it's the reassembly timer, or slot 0 is going to be
+ * dropped as S0 must have the most recent drop time, and slot 0 holds the
+ * continuation fragment of the in progress packet.
+ */
+static enum hrtimer_restart iptfs_drop_timer(struct hrtimer *me)
+{
+	struct sk_buff *skb, *next;
+	struct list_head freelist, list;
+	struct xfrm_iptfs_data *xtfs;
+	struct xfrm_state *x;
+	u32 count, fcount;
+
+	xtfs = container_of(me, typeof(*xtfs), drop_timer);
+	x = xtfs->x;
+
+	spin_lock(&xtfs->drop_lock);
+
+	INIT_LIST_HEAD(&list);
+	INIT_LIST_HEAD(&freelist);
+	fcount = 0;
+
+	/* Drop any in progress packet */
+
+	if (xtfs->ra_newskb) {
+		kfree_skb(xtfs->ra_newskb);
+		xtfs->ra_newskb = NULL;
+	}
+
+	/* Now drop as many packets as we should from the reordering window
+	 * saved array
+	 */
+	count = xtfs->w_savedlen ? __reorder_drop(xtfs, &list) : 0;
+
+	spin_unlock(&xtfs->drop_lock);
+
+	if (count) {
+		list_for_each_entry_safe(skb, next, &list, list) {
+			skb_list_del_init(skb);
+			(void)iptfs_input_ordered(x, skb);
+		}
+	}
+	return HRTIMER_NORESTART;
+}
+
+/**
+ * iptfs_input() - handle receipt of iptfs payload
+ * @x: xfrm state.
+ * @skb: the packet.
+ *
+ * We have an IPTFS payload order it if needed, then process newly in order
+ * packetsA.
+ */
+static int iptfs_input(struct xfrm_state *x, struct sk_buff *skb)
+{
+	struct list_head freelist, list;
+	struct xfrm_iptfs_data *xtfs = x->mode_data;
+	struct sk_buff *next;
+	u32 count, fcount;
+
+	/* Fast path for no reorder window. */
+	if (xtfs->cfg.reorder_win_size == 0) {
+		iptfs_input_ordered(x, skb);
+		goto done;
+	}
+
+	/* Fetch list of in-order packets from the reordering window as well as
+	 * a list of buffers we need to now free.
+	 */
+	INIT_LIST_HEAD(&list);
+	INIT_LIST_HEAD(&freelist);
+	fcount = 0;
+
+	spin_lock(&xtfs->drop_lock);
+	count = iptfs_input_reorder(xtfs, skb, &list, &freelist, &fcount);
+	spin_unlock(&xtfs->drop_lock);
+
+	if (count) {
+		list_for_each_entry_safe(skb, next, &list, list) {
+			skb_list_del_init(skb);
+			(void)iptfs_input_ordered(x, skb);
+		}
+	}
+
+	if (fcount) {
+		list_for_each_entry_safe(skb, next, &freelist, list) {
+			skb_list_del_init(skb);
+			kfree_skb(skb);
+		}
+	}
+done:
+	/* We always have dealt with the input SKB, either we are re-using it,
+	 * or we have freed it. Return EINPROGRESS so that xfrm_input stops
+	 * processing it.
+	 */
+	return -EINPROGRESS;
+}
+
+/* ================================= */
+/* IPTFS Sending (ingress) Functions */
+/* ================================= */
+
+/* ------------------------- */
+/* Enqueue to send functions */
+/* ------------------------- */
+
+/**
+ * iptfs_enqueue() - enqueue packet if ok to send.
+ * @xtfs: xtfs state
+ * @skb: the packet
+ *
+ * Return: true if packet enqueued.
+ */
+static bool iptfs_enqueue(struct xfrm_iptfs_data *xtfs, struct sk_buff *skb)
+{
+	u64 newsz = xtfs->queue_size + skb->len;
+	struct iphdr *iph;
+
+	assert_spin_locked(&xtfs->x->lock);
+
+	if (newsz > xtfs->cfg.max_queue_size)
+		return false;
+
+	/* Set ECN CE if we are above our ECN queue threshold */
+	if (newsz > xtfs->ecn_queue_size) {
+		iph = ip_hdr(skb);
+		if (iph->version == 4)
+			IP_ECN_set_ce(iph);
+		else if (iph->version == 6)
+			IP6_ECN_set_ce(skb, ipv6_hdr(skb));
+	}
+
+	__skb_queue_tail(&xtfs->queue, skb);
+	xtfs->queue_size += skb->len;
+	return true;
+}
+
+static int iptfs_get_cur_pmtu(struct xfrm_state *x, struct xfrm_iptfs_data *xtfs,
+			      struct sk_buff *skb)
+{
+	struct xfrm_dst *xdst = (struct xfrm_dst *)skb_dst(skb);
+	u32 payload_mtu = xtfs->payload_mtu;
+	u32 pmtu = __iptfs_get_inner_mtu(x, xdst->child_mtu_cached);
+
+	if (payload_mtu && payload_mtu < pmtu)
+		pmtu = payload_mtu;
+
+	return pmtu;
+}
+
+static int iptfs_is_too_big(struct sock *sk, struct sk_buff *skb, u32 pmtu)
+{
+	struct flowi6 fl6;
+
+	if (skb->len <= pmtu)
+		return 0;
+
+	/* We only send ICMP too big if the user has configured us as
+	 * dont-fragment. We need to adjust something in the
+	 * stack as we are never getting here (good) even when
+	 * our no DF config is set (bad).
+	 */
+	XFRM_INC_STATS(dev_net(skb->dev), LINUX_MIB_XFRMOUTERROR);
+
+	if (sk) {
+		if (ip_hdr(skb)->version == 4) {
+			xfrm_local_error(skb, pmtu);
+		} else {
+			WARN_ON_ONCE(ip_hdr(skb)->version != 6);
+
+			memset(&fl6, 0, sizeof(fl6));
+			ipv6_local_error(skb->sk, EMSGSIZE, &fl6, pmtu);
+		}
+	} else if (ip_hdr(skb)->version == 4) {
+		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
+			  htonl(pmtu));
+	} else {
+		WARN_ON_ONCE(ip_hdr(skb)->version != 6);
+		icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, pmtu);
+	}
+	return 1;
+}
+
+/* IPv4/IPv6 packet ingress to IPTFS tunnel, arrange to send in IPTFS payload
+ * (i.e., aggregating or fragmenting as appropriate).
+ * This is set in dst->output for an SA.
+ */
+static int iptfs_output_collect(struct net *net, struct sock *sk,
+				struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct xfrm_state *x = dst->xfrm;
+	struct xfrm_iptfs_data *xtfs = x->mode_data;
+	struct sk_buff *segs, *nskb;
+	u32 count, qcount;
+	u32 pmtu = 0;
+	bool ok = true;
+	bool was_gso;
+
+	/* We have hooked into dst_entry->output which means we have skipped the
+	 * protocol specific netfilter (see xfrm4_output, xfrm6_output).
+	 * when our timer runs we will end up calling xfrm_output directly on
+	 * the encapsulated traffic.
+	 *
+	 * For both cases this is the NF_INET_POST_ROUTING hook which allows
+	 * changing the skb->dst entry which then may not be xfrm based anymore
+	 * in which case a REROUTED flag is set. and dst_output is called.
+	 *
+	 * For IPv6 we are also skipping fragmentation handling for local
+	 * sockets, which may or may not be good depending on our tunnel DF
+	 * setting. Normally with fragmentation supported we want to skip this
+	 * fragmentation.
+	 */
+
+	BUG_ON(!xtfs);
+
+	if (xtfs->cfg.dont_frag)
+		pmtu = iptfs_get_cur_pmtu(x, xtfs, skb);
+
+	/* Break apart GSO skbs. If the queue is nearing full then we want the
+	 * accounting and queuing to be based on the individual packets not on the
+	 * aggregate GSO buffer.
+	 */
+	was_gso = skb_is_gso(skb);
+	if (!was_gso) {
+		segs = skb;
+	} else {
+		segs = skb_gso_segment(skb, 0);
+		if (IS_ERR_OR_NULL(segs)) {
+			XFRM_INC_STATS(net, LINUX_MIB_XFRMOUTERROR);
+			kfree_skb(skb);
+			return PTR_ERR(segs);
+		}
+		consume_skb(skb);
+		skb = NULL;
+	}
+
+	count = 0;
+	qcount = 0;
+
+	/* We can be running on multiple cores and from the network softirq or
+	 * from user context depending on where the packet is coming from.
+	 */
+	spin_lock_bh(&x->lock);
+
+	skb_list_walk_safe(segs, skb, nskb) {
+		skb_mark_not_on_list(skb);
+		count++;
+
+		/* Once we drop due to no queue space we continue to drop the
+		 * rest of the packets from that GRO.
+		 */
+		if (!ok) {
+nospace:
+			trace_iptfs_no_queue_space(skb, xtfs, pmtu, was_gso);
+			XFRM_INC_STATS(dev_net(skb->dev), LINUX_MIB_XFRMOUTNOQSPACE);
+			kfree_skb_reason(skb, SKB_DROP_REASON_FULL_RING);
+			continue;
+		}
+
+		/* If the user indicated no iptfs fragmenting check before
+		 * enqueue.
+		 */
+		if (xtfs->cfg.dont_frag && iptfs_is_too_big(sk, skb, pmtu)) {
+			trace_iptfs_too_big(skb, xtfs, pmtu, was_gso);
+			kfree_skb_reason(skb, SKB_DROP_REASON_PKT_TOO_BIG);
+			continue;
+		}
+
+		/* Enqueue to send in tunnel */
+
+		ok = iptfs_enqueue(xtfs, skb);
+		if (!ok)
+			goto nospace;
+
+		trace_iptfs_enqueue(skb, xtfs, pmtu, was_gso);
+		qcount++;
+	}
+
+	/* Start a delay timer if we don't have one yet */
+	if (!hrtimer_is_queued(&xtfs->iptfs_timer)) {
+		/* softirq blocked lest the timer fire and interrupt us */
+		BUG_ON(!in_interrupt());
+		hrtimer_start(&xtfs->iptfs_timer, xtfs->init_delay_ns,
+			      IPTFS_HRTIMER_MODE);
+
+		xtfs->iptfs_settime = ktime_get_raw_fast_ns();
+		trace_iptfs_timer_start(xtfs, xtfs->init_delay_ns);
+	}
+
+	spin_unlock_bh(&x->lock);
+	return 0;
+}
+
+/* -------------------------- */
+/* Dequeue and send functions */
+/* -------------------------- */
+
+static void iptfs_output_prepare_skb(struct sk_buff *skb, u32 blkoff)
+{
+	struct ip_iptfs_hdr *h;
+	size_t hsz = sizeof(*h);
+
+	/* now reset values to be pointing at the rest of the packets */
+	h = skb_push(skb, hsz);
+	memset(h, 0, hsz);
+	if (blkoff)
+		h->block_offset = htons(blkoff);
+
+	/* network_header current points at the inner IP packet
+	 * move it to the iptfs header
+	 */
+	skb->transport_header = skb->network_header;
+	skb->network_header -= hsz;
+
+	IPCB(skb)->flags |= IPSKB_XFRM_TUNNEL_SIZE;
+
+	/* xfrm[46]_prepare_output sets skb->protocol here, but the resulting
+	 * called ip[6]_output functions also set this value as appropriate so
+	 * seems unnecessary
+	 *
+	 * skb->protocol = htons(ETH_P_IP) or htons(ETH_P_IPV6);
+	 */
+}
+
+/**
+ * iptfs_copy_create_frag() - create an inner fragment skb.
+ * @st: The source packet data.
+ * @offset: offset in @st of the new fragment data.
+ * @copy_len: the amount of data to copy from @st.
+ *
+ * Create a new skb holding a single IPTFS inner packet fragment. @copy_len must
+ * not be greater than the max fragment size.
+ *
+ * Return: the new fragment skb or an ERR_PTR().
+ */
+static struct sk_buff *iptfs_copy_create_frag(struct skb_seq_state *st,
+					      u32 offset, u32 copy_len)
+{
+	struct sk_buff *src = st->root_skb;
+	struct sk_buff *skb;
+	int err;
+
+	skb = iptfs_alloc_skb(src, copy_len, true);
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	/* Now copy `copy_len` data from src */
+	err = skb_copy_bits_seq(st, offset, skb_put(skb, copy_len), copy_len);
+	if (err) {
+		XFRM_INC_STATS(dev_net(src->dev), LINUX_MIB_XFRMOUTERROR);
+		kfree_skb(skb);
+		return ERR_PTR(err);
+	}
+
+	return skb;
+}
+
+/**
+ * iptfs_copy_create_frags() - create and send N-1 fragments of a larger skb.
+ * @skbp: the source packet skb (IN), skb holding the last fragment in
+ *        the fragment stream (OUT).
+ * @xtfs: IPTFS SA state.
+ * @mtu: the max IPTFS fragment size.
+ *
+ * This function is responsible for fragmenting a larger inner packet into a
+ * sequence of IPTFS payload packets. The last fragment is returned rather than
+ * being sent so that the caller can append more inner packets (aggregation) if
+ * there is room.
+ */
+static int iptfs_copy_create_frags(struct sk_buff **skbp,
+				   struct xfrm_iptfs_data *xtfs, u32 mtu)
+{
+	struct skb_seq_state skbseq;
+	struct list_head sublist;
+	struct sk_buff *skb = *skbp;
+	struct sk_buff *nskb = *skbp;
+	u32 copy_len, offset;
+	u32 to_copy = skb->len - mtu;
+	u32 blkoff = 0;
+	int err = 0;
+
+	INIT_LIST_HEAD(&sublist);
+
+	BUG_ON(skb->len <= mtu);
+	skb_prepare_seq_read(skb, 0, skb->len, &skbseq);
+
+	/* A trimmed `skb` will be sent as the first fragment, later. */
+	offset = mtu;
+	to_copy = skb->len - offset;
+	while (to_copy) {
+		/* Send all but last fragment to allow agg. append */
+		trace_iptfs_first_fragmenting(nskb, mtu, to_copy, NULL);
+		list_add_tail(&nskb->list, &sublist);
+
+		/* FUTURE: if the packet has an odd/non-aligning length we could
+		 * send less data in the penultimate fragment so that the last
+		 * fragment then ends on an aligned boundary.
+		 */
+		copy_len = to_copy <= mtu ? to_copy : mtu;
+		nskb = iptfs_copy_create_frag(&skbseq, offset, copy_len);
+		if (IS_ERR(nskb)) {
+			XFRM_INC_STATS(dev_net(skb->dev),
+				       LINUX_MIB_XFRMOUTERROR);
+			skb_abort_seq_read(&skbseq);
+			err = PTR_ERR(nskb);
+			nskb = NULL;
+			break;
+		}
+		iptfs_output_prepare_skb(nskb, to_copy);
+		offset += copy_len;
+		to_copy -= copy_len;
+		blkoff = to_copy;
+	}
+	skb_abort_seq_read(&skbseq);
+
+	/* return last fragment that will be unsent (or NULL) */
+	*skbp = nskb;
+	if (nskb)
+		trace_iptfs_first_final_fragment(nskb, mtu, blkoff, NULL);
+
+	/* trim the original skb to MTU */
+	if (!err)
+		err = pskb_trim(skb, mtu);
+
+	if (err) {
+		/* Free all frags. Don't bother sending a partial packet we will
+		 * never complete.
+		 */
+		kfree_skb(nskb);
+		list_for_each_entry_safe(skb, nskb, &sublist, list) {
+			skb_list_del_init(skb);
+			kfree_skb(skb);
+		}
+		return err;
+	}
+
+	/* prepare the initial fragment with an iptfs header */
+	iptfs_output_prepare_skb(skb, 0);
+
+	/* Send all but last fragment. */
+	list_for_each_entry_safe(skb, nskb, &sublist, list) {
+		skb_list_del_init(skb);
+		xfrm_output(NULL, skb);
+	}
+
+	return 0;
+}
+
+/**
+ * iptfs_first_should_copy() - determine if we should copy packet data.
+ * @first_skb: the first skb in the packet
+ * @mtu: the MTU.
+ *
+ * Determine if we should create subsequent skbs to hold the remaining data from
+ * a large inner packet by copying the packet data, or cloning the original skb
+ * and adjusting the offsets.
+ */
+static bool iptfs_first_should_copy(struct sk_buff *first_skb, u32 mtu)
+{
+	u32 frag_copy_max;
+
+	/* If we have less than frag_copy_max for remaining packet we copy
+	 * those tail bytes as it is more efficient.
+	 */
+	frag_copy_max = mtu <= IPTFS_FRAG_COPY_MAX ? mtu : IPTFS_FRAG_COPY_MAX;
+	if ((int)first_skb->len - (int)mtu < (int)frag_copy_max)
+		return true;
+
+	/* If we have non-linear skb just use copy */
+	if (skb_is_nonlinear(first_skb))
+		return true;
+
+	/* So we have a simple linear skb, easy to clone and share */
+	return false;
+}
+
+/**
+ * iptfs_first_skb() - handle the first dequeued inner packet for output
+ * @skbp: the source packet skb (IN), skb holding the last fragment in
+ *        the fragment stream (OUT).
+ * @xtfs: IPTFS SA state.
+ * @mtu: the max IPTFS fragment size.
+ *
+ * This function is responsible for fragmenting a larger inner packet into a
+ * sequence of IPTFS payload packets. If it needs to fragment into subsequent
+ * skb's, it will either do so by copying or cloning.
+ *
+ * The last fragment is returned rather than being sent so that the caller can
+ * append more inner packets (aggregation) if there is room.
+ *
+ */
+static int iptfs_first_skb(struct sk_buff **skbp, struct xfrm_iptfs_data *xtfs,
+			   u32 mtu)
+{
+	struct sk_buff *skb = *skbp;
+	int err;
+
+	/* Classic ESP skips the don't fragment ICMP error if DF is clear on
+	 * the inner packet or ignore_df is set. Otherwise it will send an ICMP
+	 * or local error if the inner packet won't fit it's MTU.
+	 *
+	 * With IPTFS we do not care about the inner packet DF bit. If the
+	 * tunnel is configured to "don't fragment" we error back if things
+	 * don't fit in our max packet size. Otherwise we iptfs-fragment as
+	 * normal.
+	 */
+
+	/* The opportunity for HW offload has ended */
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		err = skb_checksum_help(skb);
+		if (err)
+			return err;
+	}
+
+	/* We've split these up before queuing */
+	BUG_ON(skb_is_gso(skb));
+
+	trace_iptfs_first_dequeue(skb, mtu, 0, ip_hdr(skb));
+
+	/* Simple case -- it fits. `mtu` accounted for all the overhead
+	 * including the basic IPTFS header.
+	 */
+	if (skb->len <= mtu) {
+		iptfs_output_prepare_skb(skb, 0);
+		return 0;
+	}
+
+	BUG_ON(xtfs->cfg.dont_frag);
+
+	if (iptfs_first_should_copy(skb, mtu))
+		return iptfs_copy_create_frags(skbp, xtfs, mtu);
+
+	/* For now we always copy */
+	return iptfs_copy_create_frags(skbp, xtfs, mtu);
+}
+
+static struct sk_buff **iptfs_rehome_fraglist(struct sk_buff **nextp,
+					      struct sk_buff *child)
+{
+	u32 fllen = 0;
+
+	BUG_ON(!skb_has_frag_list(child));
+
+	/* It might be possible to account for a frag list in addition to page
+	 * fragment if it's a valid state to be in. The page fragments size
+	 * should be kept as data_len so only the frag_list size is removed,
+	 * this must be done above as well took
+	 */
+	BUG_ON(skb_shinfo(child)->nr_frags);
+	*nextp = skb_shinfo(child)->frag_list;
+	while (*nextp) {
+		fllen += (*nextp)->len;
+		nextp = &(*nextp)->next;
+	}
+	skb_frag_list_init(child);
+	BUG_ON(fllen > child->data_len);
+	child->len -= fllen;
+	child->data_len -= fllen;
+
+	return nextp;
+}
+
+static void iptfs_consume_frags(struct sk_buff *to, struct sk_buff *from)
+{
+	struct skb_shared_info *fromi = skb_shinfo(from);
+	struct skb_shared_info *toi = skb_shinfo(to);
+	unsigned int new_truesize;
+
+	/* If we have data in a head page, grab it */
+	if (!skb_headlen(from)) {
+		new_truesize = SKB_TRUESIZE(skb_end_offset(from));
+	} else {
+		skb_head_to_frag(from, &toi->frags[toi->nr_frags]);
+		skb_frag_ref(to, toi->nr_frags++);
+		new_truesize = SKB_DATA_ALIGN(sizeof(struct sk_buff));
+	}
+
+	/* Move any other page fragments rather than copy */
+	memcpy(&toi->frags[toi->nr_frags], fromi->frags,
+	       sizeof(fromi->frags[0]) * fromi->nr_frags);
+	toi->nr_frags += fromi->nr_frags;
+	fromi->nr_frags = 0;
+	from->data_len = 0;
+	from->len = 0;
+	to->truesize += from->truesize - new_truesize;
+	from->truesize = new_truesize;
+
+	/* We are done with this SKB */
+	consume_skb(from);
+}
+
+static void iptfs_output_queued(struct xfrm_state *x, struct sk_buff_head *list)
+{
+	struct xfrm_iptfs_data *xtfs = x->mode_data;
+	u32 payload_mtu = xtfs->payload_mtu;
+	struct sk_buff *skb, *skb2, **nextp;
+	struct skb_shared_info *shi, *shi2;
+
+	/* For now we are just outputting packets as fast as we can, so if we
+	 * are fragmenting we will do so until the last inner packet has been
+	 * consumed.
+	 *
+	 * When we are fragmenting we need to output all outer packets that
+	 * contain the fragments of a single inner packet, consecutively (ESP
+	 * seq-wise). So we need a lock to keep another CPU from sending the
+	 * next batch of packets (it's `list`) and trying to output those, while
+	 * we output our `list` resuling with interleaved non-spec-client inner
+	 * packet streams. Thus we need to lock the IPTFS output on a per SA
+	 * basis while we process this list.
+	 */
+
+	/* NOTE: for the future, for timed packet sends, if our queue is not
+	 * growing longer (i.e., we are keeping up) and a packet we are about to
+	 * fragment will not fragment in then next outer packet, we might consider
+	 * holding on to it to send whole in the next slot. The question then is
+	 * does this introduce a continuous delay in the inner packet stream
+	 * with certain packet rates and sizes?
+	 */
+
+	/* and send them on their way */
+
+	while ((skb = __skb_dequeue(list))) {
+		struct xfrm_dst *xdst = (struct xfrm_dst *)skb_dst(skb);
+		u32 mtu = __iptfs_get_inner_mtu(x, xdst->child_mtu_cached);
+		bool share_ok = true;
+		int remaining;
+
+		/* protocol comes to us cleared sometimes */
+		skb->protocol = x->outer_mode.family == AF_INET ?
+					htons(ETH_P_IP) :
+					htons(ETH_P_IPV6);
+
+		if (payload_mtu && payload_mtu < mtu)
+			mtu = payload_mtu;
+
+		if (skb->len > mtu && xtfs->cfg.dont_frag) {
+			/* We handle this case before enqueueing so we are only
+			 * here b/c MTU changed after we enqueued before we
+			 * dequeued, just drop these.
+			 */
+			XFRM_INC_STATS(dev_net(skb->dev),
+				       LINUX_MIB_XFRMOUTERROR);
+
+			trace_iptfs_first_toobig(skb, mtu, 0, ip_hdr(skb));
+			kfree_skb_reason(skb, SKB_DROP_REASON_PKT_TOO_BIG);
+			continue;
+		}
+
+		/* iptfs_first_skb will free skb on error as well */
+		if (iptfs_first_skb(&skb, xtfs, mtu))
+			continue;
+
+		/* The returned skb is the last IPTFS fragment, it has it's
+		 * IPTFS header included and it's blkoff set just past the end
+		 * fragment data if needed. The space remaining to send more
+		 * inner packet data is `mtu` - (skb->len - sizeof iptfs
+		 * header). This is b/c the `mtu` value has the basic IPTFS
+		 * header len accounted for, and we added that header to the skb
+		 * so it is a part of skb->len, thus we subtract it from the skb
+		 * length.
+		 */
+		remaining = mtu - (skb->len - sizeof(struct ip_iptfs_hdr));
+
+		/* Re-home nested fragment lists. */
+		shi = skb_shinfo(skb);
+		nextp = &shi->frag_list;
+		while (*nextp) {
+			if (skb_has_frag_list(*nextp))
+				nextp = iptfs_rehome_fraglist(&(*nextp)->next,
+							      *nextp);
+			else
+				nextp = &(*nextp)->next;
+		}
+
+		if (shi->frag_list || skb_cloned(skb) || skb_shared(skb))
+			share_ok = false;
+
+		/* See if we have enough space to simply append.
+		 *
+		 * NOTE: Maybe do not append if we will be mis-aligned,
+		 * SW-based endpoints will probably have to copy in this
+		 * case.
+		 */
+		while ((skb2 = skb_peek(list))) {
+			trace_iptfs_ingress_nth_peek(skb2, remaining);
+			if (skb2->len > remaining)
+				break;
+
+			__skb_unlink(skb2, list);
+
+			/* The opportunity for HW offload has ended, if we
+			 * don't have a cksum in the packet we need to add one
+			 * before encap and transmit.
+			 */
+			if (skb2->ip_summed == CHECKSUM_PARTIAL) {
+				if (skb_checksum_help(skb2)) {
+					XFRM_INC_STATS(dev_net(skb_dst(skb2)->dev),
+						       LINUX_MIB_XFRMOUTERROR);
+					kfree_skb(skb2);
+					continue;
+				}
+			}
+
+			/* skb->pp_recycle is passed to __skb_flag_unref for all
+			 * frag pages so we can only share pages with skb's who
+			 * match ourselves.
+			 */
+			shi2 = skb_shinfo(skb2);
+			if (share_ok &&
+			    (shi2->frag_list ||
+			     (!skb2->head_frag && skb_headlen(skb)) ||
+			     skb->pp_recycle != skb2->pp_recycle ||
+			     skb_zcopy(skb2) ||
+			     (shi->nr_frags + shi2->nr_frags + 1 > MAX_SKB_FRAGS)))
+				share_ok = false;
+
+			/* do acct so we can free skb2 in share case */
+			skb->data_len += skb2->len;
+			skb->len += skb2->len;
+			remaining -= skb2->len;
+
+			trace_iptfs_ingress_nth_add(skb2, share_ok);
+
+			if (share_ok) {
+				iptfs_consume_frags(skb, skb2);
+			} else {
+				/* link on the frag_list */
+				*nextp = skb2;
+				nextp = &skb2->next;
+				BUG_ON(*nextp);
+				if (skb_has_frag_list(skb2))
+					nextp = iptfs_rehome_fraglist(nextp,
+								      skb2);
+				skb->truesize += skb2->truesize;
+			}
+		}
+
+		/* Consider fragmenting this skb2 that didn't fit. For demand
+		 * driven variable sized IPTFS pkts, though this isn't buying
+		 * a whole lot, especially if we are doing a copy which waiting
+		 * to send in a new pkt would not.
+		 */
+
+		xfrm_output(NULL, skb);
+	}
+}
+
+static enum hrtimer_restart iptfs_delay_timer(struct hrtimer *me)
+{
+	struct sk_buff_head list;
+	struct xfrm_iptfs_data *xtfs;
+	struct xfrm_state *x;
+	time64_t settime;
+	size_t osize;
+
+	xtfs = container_of(me, typeof(*xtfs), iptfs_timer);
+	x = xtfs->x;
+
+	/* Process all the queued packets
+	 *
+	 * softirq execution order: timer > tasklet > hrtimer
+	 *
+	 * Network rx will have run before us giving one last chance to queue
+	 * ingress packets for us to process and transmit.
+	 */
+
+	spin_lock(&x->lock);
+	__skb_queue_head_init(&list);
+	skb_queue_splice_init(&xtfs->queue, &list);
+	osize = xtfs->queue_size;
+	xtfs->queue_size = 0;
+	settime = xtfs->iptfs_settime;
+	spin_unlock(&x->lock);
+
+	/* After the above unlock, packets can begin queuing again, and the
+	 * timer can be set again, from another CPU either in softirq or user
+	 * context (not from this one since we are running at softirq level
+	 * already).
+	 */
+
+	trace_iptfs_timer_expire(xtfs,
+				 (unsigned long long)(ktime_get_raw_fast_ns() - settime));
+
+	iptfs_output_queued(x, &list);
+
+	return HRTIMER_NORESTART;
+}
+
+/**
+ * iptfs_encap_add_ipv4() - add outer encaps
+ *
+ * This was originally taken from xfrm4_tunnel_encap_add. The reason for the
+ * copy is that IP-TFS/AGGFRAG can have different functionality for how to set
+ * the TOS/DSCP bits. Sets the protocol to a different value and doesn't do
+ * anything with inner headers as they aren't pointing into a normal IP
+ * singleton inner packet.
+ */
+static int iptfs_encap_add_ipv4(struct xfrm_state *x, struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct iphdr *top_iph;
+	int flags;
+
+	skb_reset_inner_network_header(skb);
+	skb_reset_inner_transport_header(skb);
+
+	skb_set_network_header(skb, -(x->props.header_len - x->props.enc_hdr_len));
+	skb->mac_header = skb->network_header + offsetof(struct iphdr, protocol);
+	skb->transport_header = skb->network_header + sizeof(*top_iph);
+
+	top_iph = ip_hdr(skb);
+	top_iph->ihl = 5;
+	top_iph->version = 4;
+	top_iph->protocol = IPPROTO_AGGFRAG;
+
+	/* As we have 0, fractional, 1 or N inner packets there's no obviously
+	 * correct DSCP mapping to inherit. ECN should be cleared per RFC9347
+	 * 3.1.
+	 */
+	top_iph->tos = 0;
+
+	flags = x->props.flags;
+	top_iph->frag_off = htons(IP_DF);
+	top_iph->ttl = ip4_dst_hoplimit(xfrm_dst_child(dst));
+	top_iph->saddr = x->props.saddr.a4;
+	top_iph->daddr = x->id.daddr.a4;
+	ip_select_ident(dev_net(dst->dev), skb, NULL);
+
+	return 0;
+}
+
+/**
+ * iptfs_encap_add_ipv6() - add outer encaps
+ *
+ * This was originally taken from xfrm6_tunnel_encap_add. The reason for the
+ * copy is that IP-TFS/AGGFRAG can have different functionality for how to set
+ * the flow label and TOS/DSCP bits. It also sets the protocol to a different
+ * value and doesn't do anything with inner headers as they aren't pointing into
+ * a normal IP singleton inner packet.
+ */
+static int iptfs_encap_add_ipv6(struct xfrm_state *x, struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct ipv6hdr *top_iph;
+	int dsfield;
+
+	skb_reset_inner_network_header(skb);
+	skb_reset_inner_transport_header(skb);
+
+	skb_set_network_header(skb,
+			       -x->props.header_len + x->props.enc_hdr_len);
+	skb->mac_header = skb->network_header +
+		offsetof(struct ipv6hdr, nexthdr);
+	skb->transport_header = skb->network_header + sizeof(*top_iph);
+
+	top_iph = ipv6_hdr(skb);
+	top_iph->version = 6;
+	top_iph->priority = 0;
+	memset(top_iph->flow_lbl, 0, sizeof(top_iph->flow_lbl));
+	top_iph->nexthdr = IPPROTO_AGGFRAG;
+
+	/* As we have 0, fractional, 1 or N inner packets there's no obviously
+	 * correct DSCP mapping to inherit. ECN should be cleared per RFC9347
+	 * 3.1.
+	 */
+	dsfield = 0;
+	ipv6_change_dsfield(top_iph, 0, dsfield);
+
+	top_iph->hop_limit = ip6_dst_hoplimit(xfrm_dst_child(dst));
+	top_iph->saddr = *(struct in6_addr *)&x->props.saddr;
+	top_iph->daddr = *(struct in6_addr *)&x->id.daddr;
+
+	return 0;
+}
+
+/**
+ * iptfs_prepare_output() -  prepare the skb for output
+ *
+ * Return: Error value, if 0 then skb values should be as follows:
+ *    - transport_header should point at ESP header
+ *    - network_header should point at Outer IP header
+ *    - mac_header should point at protocol/nexthdr of the outer IP
+ */
+static int iptfs_prepare_output(struct xfrm_state *x, struct sk_buff *skb)
+{
+	if (x->outer_mode.family == AF_INET)
+		return iptfs_encap_add_ipv4(x, skb);
+	if (x->outer_mode.family == AF_INET6) {
+#if IS_ENABLED(CONFIG_IPV6)
+		return iptfs_encap_add_ipv6(x, skb);
+#else
+		WARN_ON_ONCE(1);
+		return -EAFNOSUPPORT;
+#endif
+	}
+	WARN_ON_ONCE(1);
+	return -EOPNOTSUPP;
+}
+
+/* ========================== */
+/* State Management Functions */
+/* ========================== */
+
+/**
+ * __iptfs_get_inner_mtu() - return inner MTU with no fragmentation.
+ */
+static u32 __iptfs_get_inner_mtu(struct xfrm_state *x, int outer_mtu)
+{
+	struct crypto_aead *aead;
+	u32 blksize;
+
+	aead = x->data;
+	blksize = ALIGN(crypto_aead_blocksize(aead), 4);
+	return ((outer_mtu - x->props.header_len - crypto_aead_authsize(aead)) &
+		~(blksize - 1)) - 2;
+}
+
+/**
+ * iptfs_get_mtu() - return the inner MTU for an IPTFS xfrm.
+ * @x:   XFRM state.
+ * @outer_mtu: Outer MTU for the encapsulated packet.
+ *
+ * Return: Correct MTU taking in to account the encap overhead.
+ */
+static u32 iptfs_get_inner_mtu(struct xfrm_state *x, int outer_mtu)
+{
+	struct xfrm_iptfs_data *xtfs = x->mode_data;
+
+	/* If not dont-frag we have no MTU */
+	if (!xtfs->cfg.dont_frag)
+		return x->outer_mode.family == AF_INET ? IP_MAX_MTU :
+							 IP6_MAX_MTU;
+	return __iptfs_get_inner_mtu(x, outer_mtu);
+}
+
+/**
+ * iptfs_user_init() - initialize the SA with IPTFS options from netlink.
+ */
+static int iptfs_user_init(struct net *net, struct xfrm_state *x,
+			   struct nlattr **attrs)
+{
+	struct xfrm_iptfs_data *xtfs = x->mode_data;
+	struct xfrm_iptfs_config *xc;
+
+	if (x->props.mode != XFRM_MODE_IPTFS)
+		return -EINVAL;
+
+	xc = &xtfs->cfg;
+	xc->reorder_win_size = net->xfrm.sysctl_iptfs_rewin;
+	xc->max_queue_size = net->xfrm.sysctl_iptfs_maxqsize;
+	xc->init_delay_us = net->xfrm.sysctl_iptfs_idelay;
+	xc->drop_time_us = net->xfrm.sysctl_iptfs_drptime;
+
+	if (attrs[XFRMA_IPTFS_DONT_FRAG])
+		xc->dont_frag = true;
+	if (attrs[XFRMA_IPTFS_REORD_WIN])
+		xc->reorder_win_size =
+			nla_get_u16(attrs[XFRMA_IPTFS_REORD_WIN]);
+	/* saved array is for saving 1..N seq nums from wantseq */
+	if (xc->reorder_win_size)
+		xtfs->w_saved = kcalloc(xc->reorder_win_size,
+					sizeof(*xtfs->w_saved), GFP_KERNEL);
+	if (attrs[XFRMA_IPTFS_PKT_SIZE]) {
+		xc->pkt_size = nla_get_u32(attrs[XFRMA_IPTFS_PKT_SIZE]);
+		if (!xc->pkt_size)
+			xtfs->payload_mtu = 0;
+		else if (xc->pkt_size > x->props.header_len)
+			xtfs->payload_mtu = xc->pkt_size - x->props.header_len;
+		else
+			return -EINVAL;
+	}
+	if (attrs[XFRMA_IPTFS_MAX_QSIZE])
+		xc->max_queue_size = nla_get_u32(attrs[XFRMA_IPTFS_MAX_QSIZE]);
+	if (attrs[XFRMA_IPTFS_DROP_TIME])
+		xc->drop_time_us = nla_get_u32(attrs[XFRMA_IPTFS_DROP_TIME]);
+	if (attrs[XFRMA_IPTFS_IN_DELAY])
+		xc->init_delay_us = nla_get_u32(attrs[XFRMA_IPTFS_IN_DELAY]);
+
+	xtfs->ecn_queue_size = (u64)xc->max_queue_size * 95 / 100;
+	xtfs->drop_time_ns = xc->drop_time_us * NSECS_IN_USEC;
+	xtfs->init_delay_ns = xc->init_delay_us * NSECS_IN_USEC;
+
+	return 0;
+}
+
+static int iptfs_copy_to_user(struct xfrm_state *x, struct sk_buff *skb)
+{
+	struct xfrm_iptfs_data *xtfs = x->mode_data;
+	struct xfrm_iptfs_config *xc = &xtfs->cfg;
+	int ret;
+
+	if (xc->dont_frag) {
+		ret = nla_put_flag(skb, XFRMA_IPTFS_DONT_FRAG);
+		if (ret)
+			return ret;
+	}
+	ret = nla_put_u16(skb, XFRMA_IPTFS_REORD_WIN, xc->reorder_win_size);
+	if (ret)
+		return ret;
+	ret = nla_put_u32(skb, XFRMA_IPTFS_PKT_SIZE, xc->pkt_size);
+	if (ret)
+		return ret;
+	ret = nla_put_u32(skb, XFRMA_IPTFS_MAX_QSIZE, xc->max_queue_size);
+	if (ret)
+		return ret;
+	ret = nla_put_u32(skb, XFRMA_IPTFS_DROP_TIME, xc->drop_time_us);
+	if (ret)
+		return ret;
+	ret = nla_put_u32(skb, XFRMA_IPTFS_IN_DELAY, xc->init_delay_us);
+	return ret;
+}
+
+static int iptfs_create_state(struct xfrm_state *x)
+{
+	struct xfrm_iptfs_data *xtfs;
+
+	xtfs = kzalloc(sizeof(*xtfs), GFP_KERNEL);
+	if (!xtfs)
+		return -ENOMEM;
+	x->mode_data = xtfs;
+
+	xtfs->x = x;
+
+	__skb_queue_head_init(&xtfs->queue);
+	xtfs->init_delay_ns = xtfs->cfg.init_delay_us * NSECS_IN_USEC;
+	hrtimer_init(&xtfs->iptfs_timer, CLOCK_MONOTONIC, IPTFS_HRTIMER_MODE);
+	xtfs->iptfs_timer.function = iptfs_delay_timer;
+
+	xtfs->drop_time_ns = xtfs->cfg.drop_time_us * NSECS_IN_USEC;
+	spin_lock_init(&xtfs->drop_lock);
+	hrtimer_init(&xtfs->drop_timer, CLOCK_MONOTONIC, IPTFS_HRTIMER_MODE);
+	xtfs->drop_timer.function = iptfs_drop_timer;
+
+	/* Modify type (esp) adjustment values */
+
+	if (x->props.family == AF_INET)
+		x->props.header_len += sizeof(struct iphdr) + sizeof(struct ip_iptfs_hdr);
+	else if (x->props.family == AF_INET6)
+		x->props.header_len += sizeof(struct ipv6hdr) + sizeof(struct ip_iptfs_hdr);
+	x->props.enc_hdr_len = sizeof(struct ip_iptfs_hdr);
+
+	return 0;
+}
+
+static void iptfs_delete_state(struct xfrm_state *x)
+{
+	struct xfrm_iptfs_data *xtfs = x->mode_data;
+
+	if (IS_ERR_OR_NULL(xtfs))
+		return;
+
+	spin_lock(&xtfs->drop_lock);
+	hrtimer_cancel(&xtfs->iptfs_timer);
+	hrtimer_cancel(&xtfs->drop_timer);
+	spin_unlock(&xtfs->drop_lock);
+
+	kfree_sensitive(xtfs->w_saved);
+	kfree_sensitive(xtfs);
+}
+
+static const struct xfrm_mode_cbs iptfs_mode_cbs = {
+	.owner = THIS_MODULE,
+	.create_state = iptfs_create_state,
+	.delete_state = iptfs_delete_state,
+	.user_init = iptfs_user_init,
+	.copy_to_user = iptfs_copy_to_user,
+	.get_inner_mtu = iptfs_get_inner_mtu,
+	.input = iptfs_input,
+	.output = iptfs_output_collect,
+	.prepare_output = iptfs_prepare_output,
+};
+
+static int __init xfrm_iptfs_init(void)
+{
+	int err;
+
+	pr_info("xfrm_iptfs: IPsec IP-TFS tunnel mode module\n");
+
+	err = xfrm_register_mode_cbs(XFRM_MODE_IPTFS, &iptfs_mode_cbs);
+	if (err < 0)
+		pr_info("%s: can't register IP-TFS\n", __func__);
+
+	return err;
+}
+
+static void __exit xfrm_iptfs_fini(void)
+{
+	xfrm_unregister_mode_cbs(XFRM_MODE_IPTFS);
+}
+
+module_init(xfrm_iptfs_init);
+module_exit(xfrm_iptfs_fini);
+MODULE_LICENSE("GPL");
-- 
2.42.0


^ permalink raw reply related

* [RFC ipsec-next 1/8] iptfs: config: add CONFIG_XFRM_IPTFS
From: Christian Hopps @ 2023-11-10 11:37 UTC (permalink / raw)
  To: devel; +Cc: Steffen Klassert, netdev, Christian Hopps
In-Reply-To: <20231110113719.3055788-1-chopps@chopps.org>

From: Christian Hopps <chopps@labn.net>

Signed-off-by: Christian Hopps <chopps@labn.net>
---
 net/xfrm/Kconfig  | 9 +++++++++
 net/xfrm/Makefile | 1 +
 2 files changed, 10 insertions(+)

diff --git a/net/xfrm/Kconfig b/net/xfrm/Kconfig
index 3adf31a83a79..d07852069e68 100644
--- a/net/xfrm/Kconfig
+++ b/net/xfrm/Kconfig
@@ -134,6 +134,15 @@ config NET_KEY_MIGRATE
 
 	  If unsure, say N.
 
+config XFRM_IPTFS
+	bool "IPsec IPTFS (RFC 9347) encapsulation support"
+	depends on XFRM
+	help
+	  Information on the IPTFS encapsulation can be found
+          in RFC 9347.
+
+          If unsure, say N.
+
 config XFRM_ESPINTCP
 	bool
 
diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
index cd47f88921f5..9b870a3274a7 100644
--- a/net/xfrm/Makefile
+++ b/net/xfrm/Makefile
@@ -20,4 +20,5 @@ obj-$(CONFIG_XFRM_USER) += xfrm_user.o
 obj-$(CONFIG_XFRM_USER_COMPAT) += xfrm_compat.o
 obj-$(CONFIG_XFRM_IPCOMP) += xfrm_ipcomp.o
 obj-$(CONFIG_XFRM_INTERFACE) += xfrm_interface.o
+obj-$(CONFIG_XFRM_IPTFS) += xfrm_iptfs.o
 obj-$(CONFIG_XFRM_ESPINTCP) += espintcp.o
-- 
2.42.0


^ permalink raw reply related

* [PATCH net] net: mdio: fix typo in header
From: Marek Behún @ 2023-11-10 12:05 UTC (permalink / raw)
  To: netdev; +Cc: Andrew Lunn, Jakub Kicinski, Marek Behún

The quotes symbol in
  "EEE "link partner ability 1
should be at the end of the register name
  "EEE link partner ability 1"

Signed-off-by: Marek Behún <kabel@kernel.org>
---
 include/linux/mdio.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mdio.h b/include/linux/mdio.h
index 8fa23bdcedbf..007fd9c3e4b6 100644
--- a/include/linux/mdio.h
+++ b/include/linux/mdio.h
@@ -420,7 +420,7 @@ static inline u32 linkmode_adv_to_mii_t1_adv_m_t(unsigned long *advertising)
  * A function that translates value of following registers to the linkmode:
  * IEEE 802.3-2018 45.2.3.10 "EEE control and capability 1" register (3.20)
  * IEEE 802.3-2018 45.2.7.13 "EEE advertisement 1" register (7.60)
- * IEEE 802.3-2018 45.2.7.14 "EEE "link partner ability 1 register (7.61)
+ * IEEE 802.3-2018 45.2.7.14 "EEE link partner ability 1" register (7.61)
  */
 static inline void mii_eee_cap1_mod_linkmode_t(unsigned long *adv, u32 val)
 {
-- 
2.41.0


^ permalink raw reply related

* Re: [PATCH 04/22] [RESEND] time: make sysfs_get_uname() function visible in header
From: Uwe Kleine-König @ 2023-11-10 12:41 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Andrew Morton, linux-kernel, Masahiro Yamada, linux-kbuild,
	Arnd Bergmann, Matt Turner, Vineet Gupta, Russell King,
	Catalin Marinas, Will Deacon, Steven Rostedt, Masami Hiramatsu,
	Mark Rutland, Guo Ren, Peter Zijlstra, Ard Biesheuvel,
	Huacai Chen, Greg Ungerer, Michal Simek, Thomas Bogendoerfer,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Geoff Levand, Palmer Dabbelt, Heiko Carstens,
	John Paul Adrian Glaubitz, David S. Miller, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, x86, Helge Deller, Sudip Mukherjee,
	Greg Kroah-Hartman, Timur Tabi, Kent Overstreet, David Woodhouse,
	Naveen N. Rao, Anil S Keshavamurthy, Kees Cook, Vincenzo Frascino,
	Juri Lelli, Vincent Guittot, Nathan Chancellor, Nick Desaulniers,
	Nicolas Schier, Al Viro, linux-alpha, linux-snps-arc,
	linux-arm-kernel, linux-trace-kernel, linux-csky, loongarch,
	linux-m68k, linux-mips, linuxppc-dev, linux-riscv, linux-s390,
	linux-sh, sparclinux, netdev, linux-parisc, linux-usb,
	linux-fbdev, dri-devel, linux-bcachefs, linux-mtd
In-Reply-To: <20231108125843.3806765-5-arnd@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 919 bytes --]

Hello Arnd,

On Wed, Nov 08, 2023 at 01:58:25PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> This function is defined globally in clocksource.c and used conditionally
> in clockevent.c, which the declaration hidden when clockevent support

s/which/with/ ?

> is disabled. This causes a harmless warning in the definition:
> 
> kernel/time/clocksource.c:1324:9: warning: no previous prototype for 'sysfs_get_uname' [-Wmissing-prototypes]
>  1324 | ssize_t sysfs_get_uname(const char *buf, char *dst, size_t cnt)
> 
> Move the declaration out of the #ifdef so it is always visible.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Other than that:

Reviewed-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | https://www.pengutronix.de/ |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox