Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] tcp: md5: incorrect tcp_header_len for incoming connections
From: Francesco Ruggeri @ 2022-04-21  0:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paolo Abeni, Jakub Kicinski, David Ahern, Hideaki YOSHIFUJI,
	David Miller, LKML, netdev
In-Reply-To: <CANn89iJjwV2gAKMc4iydUt_MqtnB-4_EKdVrqQO9q4Dt17Lf9w@mail.gmail.com>

On Wed, Apr 20, 2022 at 5:20 PM Eric Dumazet <edumazet@google.com> wrote:
> On Wed, Apr 20, 2022 at 4:57 PM Francesco Ruggeri <fruggeri@arista.com> wrote:
> This seems like a day-0 bug, right ?
>
> Do you agree on adding
>
> Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")
>
> Thanks.
>
I also think it is a day-0 bug. Should I resubmit with "Fixes:" ?

^ permalink raw reply

* Re: linux-next: build failure after merge of the bpf-next tree
From: Stephen Rothwell @ 2022-04-21  0:32 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov, Andrii Nakryiko
  Cc: bpf, Networking, Maciej Fijalkowski, Linux Kernel Mailing List,
	Linux Next Mailing List
In-Reply-To: <20220419115620.65580586@canb.auug.org.au>

[-- Attachment #1: Type: text/plain, Size: 1994 bytes --]

Hi all,

On Tue, 19 Apr 2022 11:56:20 +1000 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> After merging the bpf-next tree, today's linux-next build
> (x86_64 allmodconfig) failed like this:
> 
> In file included from include/linux/compiler_types.h:73,
>                  from <command-line>:
> drivers/net/ethernet/intel/i40e/i40e_xsk.c: In function 'i40e_run_xdp_zc':
> include/linux/compiler_attributes.h:222:41: error: attribute 'fallthrough' not preceding a case label or default label [-Werror]
>   222 | # define fallthrough                    __attribute__((__fallthrough__))
>       |                                         ^~~~~~~~~~~~~
> drivers/net/ethernet/intel/i40e/i40e_xsk.c:192:17: note: in expansion of macro 'fallthrough'
>   192 |                 fallthrough; /* handle aborts by dropping packet */
>       |                 ^~~~~~~~~~~
> cc1: all warnings being treated as errors
> In file included from include/linux/compiler_types.h:73,
>                  from <command-line>:
> drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c: In function 'ixgbe_run_xdp_zc':
> include/linux/compiler_attributes.h:222:41: error: attribute 'fallthrough' not preceding a case label or default label [-Werror]
>   222 | # define fallthrough                    __attribute__((__fallthrough__))
>       |                                         ^~~~~~~~~~~~~
> drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c:147:17: note: in expansion of macro 'fallthrough'
>   147 |                 fallthrough; /* handle aborts by dropping packet */
>       |                 ^~~~~~~~~~~
> cc1: all warnings being treated as errors
> 
> Caused by commits
> 
>   b8aef650e549 ("i40e, xsk: Terminate Rx side of NAPI when XSK Rx queue gets full")
>   c7dd09fd4628 ("ixgbe, xsk: Terminate Rx side of NAPI when XSK Rx queue gets full")
> 
> I have used the bpf-next tree from next-20220414 for today.

I am still getting these failures ...

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH] tcp: md5: incorrect tcp_header_len for incoming connections
From: Eric Dumazet @ 2022-04-21  0:20 UTC (permalink / raw)
  To: Francesco Ruggeri
  Cc: Paolo Abeni, Jakub Kicinski, David Ahern, Hideaki YOSHIFUJI,
	David Miller, LKML, netdev
In-Reply-To: <20220420235659.830155EC021C@us226.sjc.aristanetworks.com>

On Wed, Apr 20, 2022 at 4:57 PM Francesco Ruggeri <fruggeri@arista.com> wrote:
>
> In tcp_create_openreq_child we adjust tcp_header_len for md5 using the
> remote address in newsk. But that address is still 0 in newsk at this
> point, and it is only set later by the callers (tcp_v[46]_syn_recv_sock).
> Use the address from the request socket instead.
>

Nice catch.

This seems like a day-0 bug, right ?

Do you agree on adding

Fixes: cfb6eeb4c860 ("[TCP]: MD5 Signature Option (RFC2385) support.")

Thanks.

> Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
> ---
>  net/ipv4/tcp_minisocks.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 6366df7aaf2a..6854bb1fb32b 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -531,7 +531,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
>         newtp->tsoffset = treq->ts_off;
>  #ifdef CONFIG_TCP_MD5SIG
>         newtp->md5sig_info = NULL;      /*XXX*/
> -       if (newtp->af_specific->md5_lookup(sk, newsk))
> +       if (treq->af_specific->req_md5_lookup(sk, req_to_sk(req)))
>                 newtp->tcp_header_len += TCPOLEN_MD5SIG_ALIGNED;
>  #endif
>         if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len)
> --
> 2.28.0
>
>

^ permalink raw reply

* [PATCH] tcp: md5: incorrect tcp_header_len for incoming connections
From: Francesco Ruggeri @ 2022-04-20 23:56 UTC (permalink / raw)
  To: pabeni, kuba, dsahern, yoshfuji, davem, edumazet, linux-kernel,
	netdev, fruggeri

In tcp_create_openreq_child we adjust tcp_header_len for md5 using the
remote address in newsk. But that address is still 0 in newsk at this
point, and it is only set later by the callers (tcp_v[46]_syn_recv_sock).
Use the address from the request socket instead.

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
---
 net/ipv4/tcp_minisocks.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 6366df7aaf2a..6854bb1fb32b 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -531,7 +531,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 	newtp->tsoffset = treq->ts_off;
 #ifdef CONFIG_TCP_MD5SIG
 	newtp->md5sig_info = NULL;	/*XXX*/
-	if (newtp->af_specific->md5_lookup(sk, newsk))
+	if (treq->af_specific->req_md5_lookup(sk, req_to_sk(req)))
 		newtp->tcp_header_len += TCPOLEN_MD5SIG_ALIGNED;
 #endif
 	if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len)
-- 
2.28.0



^ permalink raw reply related

* Re: [PATCH 4.19 000/338] 4.19.238-rc1 review
From: Naresh Kamboju @ 2022-04-20 23:48 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-kernel, stable, torvalds, akpm, linux, shuah, patches,
	lkft-triage, pavel, jonathanh, f.fainelli, sudipm.mukherjee,
	slade, Netdev, David S. Miller, Jakub Kicinski, Paolo Abeni,
	NeilBrown, Trond Myklebust, linux-nfs, Anna Schumaker
In-Reply-To: <CA+G9fYvgzFW7sMZVdw5r970QNNg4OK8=pbQV0kDfbOX-rXu5Rw@mail.gmail.com>

On Mon, 18 Apr 2022 at 14:09, Naresh Kamboju <naresh.kamboju@linaro.org> wrote:
>
> On Thu, 14 Apr 2022 at 18:45, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> >
> > This is the start of the stable review cycle for the 4.19.238 release.
> > There are 338 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> >
> > Responses should be made by Sat, 16 Apr 2022 11:07:54 +0000.
> > Anything received after that time might be too late.
> >
> > The whole patch series can be found in one patch at:
> >         https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.19.238-rc1.gz
> > or in the git tree and branch at:
> >         git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git linux-4.19.y
> > and the diffstat can be found below.
> >
> > thanks,
> >
> > greg k-h
>
>
> Following kernel warning noticed on arm64 Juno-r2 while booting
> stable-rc 4.19.238. Here is the full test log link [1].
>
> [    0.000000] Booting Linux on physical CPU 0x0000000100 [0x410fd033]
> [    0.000000] Linux version 4.19.238 (tuxmake@tuxmake) (gcc version
> 11.2.0 (Debian 11.2.0-18)) #1 SMP PREEMPT @1650206156
> [    0.000000] Machine model: ARM Juno development board (r2)
> <trim>
> [   18.499895] ================================
> [   18.504172] WARNING: inconsistent lock state
> [   18.508451] 4.19.238 #1 Not tainted
> [   18.511944] --------------------------------
> [   18.516222] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
> [   18.522242] kworker/u12:3/60 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [   18.527826] (____ptrval____)
> (&(&xprt->transport_lock)->rlock){+.?.}, at: xprt_destroy+0x70/0xe0
> [   18.536648] {IN-SOFTIRQ-W} state was registered at:
> [   18.541543]   lock_acquire+0xc8/0x23c
> [   18.545216]   _raw_spin_lock+0x50/0x64
> [   18.548973]   xs_tcp_state_change+0x1b4/0x440
> [   18.553343]   tcp_rcv_state_process+0x684/0x1300
> [   18.557972]   tcp_v4_do_rcv+0x70/0x290
> [   18.561731]   tcp_v4_rcv+0xc34/0xda0
> [   18.565316]   ip_local_deliver_finish+0x16c/0x3c0
> [   18.570032]   ip_local_deliver+0x6c/0x240
> [   18.574051]   ip_rcv_finish+0x98/0xe4
> [   18.577722]   ip_rcv+0x68/0x210
> [   18.580871]   __netif_receive_skb_one_core+0x6c/0x9c
> [   18.585847]   __netif_receive_skb+0x2c/0x74
> [   18.590039]   netif_receive_skb_internal+0x88/0x20c
> [   18.594928]   netif_receive_skb+0x68/0x1a0
> [   18.599036]   smsc911x_poll+0x104/0x290
> [   18.602881]   net_rx_action+0x124/0x4bc
> [   18.606727]   __do_softirq+0x1d0/0x524
> [   18.610484]   irq_exit+0x11c/0x144
> [   18.613894]   __handle_domain_irq+0x84/0xe0
> [   18.618086]   gic_handle_irq+0x5c/0xb0
> [   18.621843]   el1_irq+0xb4/0x130
> [   18.625081]   cpuidle_enter_state+0xc0/0x3ec
> [   18.629361]   cpuidle_enter+0x38/0x4c
> [   18.633032]   do_idle+0x200/0x2c0
> [   18.636353]   cpu_startup_entry+0x30/0x50
> [   18.640372]   rest_init+0x260/0x270
> [   18.643870]   start_kernel+0x45c/0x490
> [   18.647625] irq event stamp: 18931
> [   18.651037] hardirqs last  enabled at (18931): [<ffff00000832e800>]
> kfree+0xe0/0x370
> [   18.658799] hardirqs last disabled at (18930): [<ffff00000832e7ec>]
> kfree+0xcc/0x370
> [   18.666564] softirqs last  enabled at (18920): [<ffff000008fbce94>]
> rpc_wake_up_first_on_wq+0xb4/0x1b0
> [   18.675893] softirqs last disabled at (18918): [<ffff000008fbce18>]
> rpc_wake_up_first_on_wq+0x38/0x1b0
> [   18.685217]
> [   18.685217] other info that might help us debug this:
> [   18.691758]  Possible unsafe locking scenario:
> [   18.691758]
> [   18.697689]        CPU0
> [   18.700137]        ----
> [   18.702586]   lock(&(&xprt->transport_lock)->rlock);
> [   18.707562]   <Interrupt>
> [   18.710184]     lock(&(&xprt->transport_lock)->rlock);
> [   18.715335]
> [   18.715335]  *** DEADLOCK ***

My bisect script pointed to the following kernel commit,

BAT BISECTION OLD: This iteration (kernel rev
2d235d26dcf81d34c93ba8616d75c804b5ee5f3f) presents old behavior.
242a3e0c75b64b4ced82e29e07a6d6d98eeec826 is the first new commit
commit 242a3e0c75b64b4ced82e29e07a6d6d98eeec826
Author: NeilBrown <neilb@suse.de>
Date:   Tue Mar 8 13:42:17 2022 +1100

    SUNRPC: avoid race between mod_timer() and del_timer_sync()

    commit 3848e96edf4788f772d83990022fa7023a233d83 upstream.

    xprt_destory() claims XPRT_LOCKED and then calls del_timer_sync().
    Both xprt_unlock_connect() and xprt_release() call
     ->release_xprt()
    which drops XPRT_LOCKED and *then* xprt_schedule_autodisconnect()
    which calls mod_timer().

    This may result in mod_timer() being called *after* del_timer_sync().
    When this happens, the timer may fire long after the xprt has been freed,
    and run_timer_softirq() will probably crash.

    The pairing of ->release_xprt() and xprt_schedule_autodisconnect() is
    always called under ->transport_lock.  So if we take ->transport_lock to
    call del_timer_sync(), we can be sure that mod_timer() will run first
    (if it runs at all).

    Cc: stable@vger.kernel.org
    Signed-off-by: NeilBrown <neilb@suse.de>
    Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 net/sunrpc/xprt.c | 7 +++++++
 1 file changed, 7 insertions(+)

Reported-by: Linux Kernel Functional Testing <lkft@linaro.org>

 --
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply

* Re: [PATCH bpf-next v2 2/6] ftrace: Fix deadloop caused by direct call in ftrace selftest
From: Steven Rostedt @ 2022-04-20 23:24 UTC (permalink / raw)
  To: Xu Kuohai
  Cc: bpf, linux-arm-kernel, linux-kernel, netdev, linux-kselftest,
	Catalin Marinas, Will Deacon, Ingo Molnar, Daniel Borkmann,
	Alexei Starovoitov, Zi Shen Lim, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, David S . Miller, Hideaki YOSHIFUJI, David Ahern,
	Thomas Gleixner, Borislav Petkov, Dave Hansen, x86, hpa,
	Shuah Khan, Mark Rutland, Ard Biesheuvel, Pasha Tatashin,
	Peter Collingbourne, Daniel Kiss, Sudeep Holla, Steven Price,
	Marc Zyngier, Mark Brown, Kumar Kartikeya Dwivedi,
	Delyan Kratunov
In-Reply-To: <20220414162220.1985095-3-xukuohai@huawei.com>

On Thu, 14 Apr 2022 12:22:16 -0400
Xu Kuohai <xukuohai@huawei.com> wrote:

> After direct call is enabled for arm64, ftrace selftest enters a
> dead loop:
> 
> <trace_selftest_dynamic_test_func>:
> 00  bti     c
> 01  mov     x9, x30                            <trace_direct_tramp>:
> 02  bl      <trace_direct_tramp>    ---------->     ret
>                                                      |
>                                          lr/x30 is 03, return to 03
>                                                      |
> 03  mov     w0, #0x0   <-----------------------------|
>      |                                               |
>      |                   dead loop!                  |
>      |                                               |
> 04  ret   ---- lr/x30 is still 03, go back to 03 ----|
> 
> The reason is that when the direct caller trace_direct_tramp() returns
> to the patched function trace_selftest_dynamic_test_func(), lr is still
> the address after the instrumented instruction in the patched function,
> so when the patched function exits, it returns to itself!
> 
> To fix this issue, we need to restore lr before trace_direct_tramp()
> exits, so make trace_direct_tramp() a weak symbol and rewrite it for
> arm64.
> 
> To detect this issue directly, call DYN_FTRACE_TEST_NAME() before
> register_ftrace_graph().
> 
> Reported-by: Li Huafei <lihuafei1@huawei.com>
> Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
> ---
>  arch/arm64/kernel/entry-ftrace.S | 10 ++++++++++
>  kernel/trace/trace_selftest.c    |  4 +++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
> index dfe62c55e3a2..e58eb06ec9b2 100644
> --- a/arch/arm64/kernel/entry-ftrace.S
> +++ b/arch/arm64/kernel/entry-ftrace.S
> @@ -357,3 +357,13 @@ SYM_CODE_START(return_to_handler)
>  	ret
>  SYM_CODE_END(return_to_handler)
>  #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
> +
> +#ifdef CONFIG_FTRACE_SELFTEST
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> +SYM_FUNC_START(trace_direct_tramp)
> +	mov	x10, x30
> +	mov	x30, x9
> +	ret	x10
> +SYM_FUNC_END(trace_direct_tramp)
> +#endif /* CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS */
> +#endif /* CONFIG_FTRACE_SELFTEST */
> diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
> index abcadbe933bb..38b0d5c9a1e0 100644
> --- a/kernel/trace/trace_selftest.c
> +++ b/kernel/trace/trace_selftest.c
> @@ -785,7 +785,7 @@ static struct fgraph_ops fgraph_ops __initdata  = {
>  };
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS
> -noinline __noclone static void trace_direct_tramp(void) { }
> +void __weak trace_direct_tramp(void) { }
>  #endif
>  
>  /*


> @@ -868,6 +868,8 @@ trace_selftest_startup_function_graph(struct tracer *trace,
>  	if (ret)
>  		goto out;
>  
> +	DYN_FTRACE_TEST_NAME();

This doesn't look like it belongs in this patch.

-- Steve

> +
>  	ret = register_ftrace_graph(&fgraph_ops);
>  	if (ret) {
>  		warn_failed_init_tracer(trace, ret);


^ permalink raw reply

* [PATCH net-next 3/3] ipv4: Initialise ->flowi4_scope properly in ICMP handlers.
From: Guillaume Nault @ 2022-04-20 23:21 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern
In-Reply-To: <cover.1650470610.git.gnault@redhat.com>

All the *_redirect() and *_update_pmtu() functions initialise their
struct flowi4 variable with either __build_flow_key() or
build_sk_flow_key(). When sk is provided, these functions use
RT_CONN_FLAGS() to set ->flowi4_tos and always use RT_SCOPE_UNIVERSE
for ->flowi4_scope. Then they rely on ip_rt_fix_tos() to adjust the
scope based on the RTO_ONLINK bit and to mask the tos with
IPTOS_RT_MASK.

This patch modifies __build_flow_key() and build_sk_flow_key() to
properly initialise ->flowi4_tos and ->flowi4_scope, so that the
ICMP redirects and PMTU handlers don't need an extra call to
ip_rt_fix_tos() before doing a fib lookup. That is, we:

  * Drop RT_CONN_FLAGS(): use ip_sock_rt_tos() and ip_sock_rt_scope()
    instead, so that we don't have to rely on ip_rt_fix_tos() to adjust
    the scope anymore.

  * Apply IPTOS_RT_MASK to the tos, so that we don't need
    ip_rt_fix_tos() to do it for us.

  * Drop the ip_rt_fix_tos() calls that now become useless.

The only remaining ip_rt_fix_tos() caller is ip_route_output_key_hash()
which needs it as long as external callers still use the RTO_ONLINK
flag.

Note:
  This patch also drops some useless RT_TOS() calls as IPTOS_RT_MASK is
  a stronger mask.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
 net/ipv4/route.c | 37 +++++++++++++++++--------------------
 1 file changed, 17 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index d8f82c0ac132..ffbe2e4f8c89 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -508,23 +508,24 @@ static void ip_rt_fix_tos(struct flowi4 *fl4)
 }
 
 static void __build_flow_key(const struct net *net, struct flowi4 *fl4,
-			     const struct sock *sk,
-			     const struct iphdr *iph,
-			     int oif, u8 tos,
-			     u8 prot, u32 mark, int flow_flags)
+			     const struct sock *sk, const struct iphdr *iph,
+			     int oif, __u8 tos, u8 prot, u32 mark,
+			     int flow_flags)
 {
+	__u8 scope = RT_SCOPE_UNIVERSE;
+
 	if (sk) {
 		const struct inet_sock *inet = inet_sk(sk);
 
 		oif = sk->sk_bound_dev_if;
 		mark = sk->sk_mark;
-		tos = RT_CONN_FLAGS(sk);
+		tos = ip_sock_rt_tos(sk);
+		scope = ip_sock_rt_scope(sk);
 		prot = inet->hdrincl ? IPPROTO_RAW : sk->sk_protocol;
 	}
-	flowi4_init_output(fl4, oif, mark, tos,
-			   RT_SCOPE_UNIVERSE, prot,
-			   flow_flags,
-			   iph->daddr, iph->saddr, 0, 0,
+
+	flowi4_init_output(fl4, oif, mark, tos & IPTOS_RT_MASK, scope,
+			   prot, flow_flags, iph->daddr, iph->saddr, 0, 0,
 			   sock_net_uid(net, sk));
 }
 
@@ -534,9 +535,9 @@ static void build_skb_flow_key(struct flowi4 *fl4, const struct sk_buff *skb,
 	const struct net *net = dev_net(skb->dev);
 	const struct iphdr *iph = ip_hdr(skb);
 	int oif = skb->dev->ifindex;
-	u8 tos = RT_TOS(iph->tos);
 	u8 prot = iph->protocol;
 	u32 mark = skb->mark;
+	__u8 tos = iph->tos;
 
 	__build_flow_key(net, fl4, sk, iph, oif, tos, prot, mark, 0);
 }
@@ -552,7 +553,8 @@ static void build_sk_flow_key(struct flowi4 *fl4, const struct sock *sk)
 	if (inet_opt && inet_opt->opt.srr)
 		daddr = inet_opt->opt.faddr;
 	flowi4_init_output(fl4, sk->sk_bound_dev_if, sk->sk_mark,
-			   RT_CONN_FLAGS(sk), RT_SCOPE_UNIVERSE,
+			   ip_sock_rt_tos(sk) & IPTOS_RT_MASK,
+			   ip_sock_rt_scope(sk),
 			   inet->hdrincl ? IPPROTO_RAW : sk->sk_protocol,
 			   inet_sk_flowi_flags(sk),
 			   daddr, inet->inet_saddr, 0, 0, sk->sk_uid);
@@ -825,14 +827,13 @@ static void ip_do_redirect(struct dst_entry *dst, struct sock *sk, struct sk_buf
 	const struct iphdr *iph = (const struct iphdr *) skb->data;
 	struct net *net = dev_net(skb->dev);
 	int oif = skb->dev->ifindex;
-	u8 tos = RT_TOS(iph->tos);
 	u8 prot = iph->protocol;
 	u32 mark = skb->mark;
+	__u8 tos = iph->tos;
 
 	rt = (struct rtable *) dst;
 
 	__build_flow_key(net, &fl4, sk, iph, oif, tos, prot, mark, 0);
-	ip_rt_fix_tos(&fl4);
 	__ip_do_redirect(rt, skb, &fl4, true);
 }
 
@@ -1061,7 +1062,6 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk,
 	struct flowi4 fl4;
 
 	ip_rt_build_flow_key(&fl4, sk, skb);
-	ip_rt_fix_tos(&fl4);
 
 	/* Don't make lookup fail for bridged encapsulations */
 	if (skb && netif_is_any_bridge_port(skb->dev))
@@ -1078,8 +1078,8 @@ void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu,
 	struct rtable *rt;
 	u32 mark = IP4_REPLY_MARK(net, skb->mark);
 
-	__build_flow_key(net, &fl4, NULL, iph, oif,
-			 RT_TOS(iph->tos), protocol, mark, 0);
+	__build_flow_key(net, &fl4, NULL, iph, oif, iph->tos, protocol, mark,
+			 0);
 	rt = __ip_route_output_key(net, &fl4);
 	if (!IS_ERR(rt)) {
 		__ip_rt_update_pmtu(rt, &fl4, mtu);
@@ -1136,8 +1136,6 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu)
 			goto out;
 
 		new = true;
-	} else {
-		ip_rt_fix_tos(&fl4);
 	}
 
 	__ip_rt_update_pmtu((struct rtable *)xfrm_dst_path(&rt->dst), &fl4, mtu);
@@ -1169,8 +1167,7 @@ void ipv4_redirect(struct sk_buff *skb, struct net *net,
 	struct flowi4 fl4;
 	struct rtable *rt;
 
-	__build_flow_key(net, &fl4, NULL, iph, oif,
-			 RT_TOS(iph->tos), protocol, 0, 0);
+	__build_flow_key(net, &fl4, NULL, iph, oif, iph->tos, protocol, 0, 0);
 	rt = __ip_route_output_key(net, &fl4);
 	if (!IS_ERR(rt)) {
 		__ip_do_redirect(rt, skb, &fl4, false);
-- 
2.21.3


^ permalink raw reply related

* [PATCH net-next 0/3] ipv4: First steps toward removing RTO_ONLINK
From: Guillaume Nault @ 2022-04-20 23:21 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern, dccp

RTO_ONLINK is a flag that allows to reduce the scope of route lookups.
It's stored in a normally unused bit of the ->flowi4_tos field, in
struct flowi4. However it has several problems:

 * This bit is also used by ECN. Although ECN bits are supposed to be
   cleared before doing a route lookup, it happened that some code
   paths didn't properly sanitise their ->flowi4_tos. So this mechanism
   is fragile and we had bugs in the past where ECN bits slipped in and
   could end up being erroneously interpreted as RTO_ONLINK.

 * A dscp_t type was recently introduced to ensure ECN bits are cleared
   during route lookups. ->flowi4_tos is the most important structure
   field to convert, but RTO_ONLINK prevents such conversion, as dscp_t
   mandates that ECN bits (where RTO_ONLINK is stored) be zero.

Therefore we need to stop using RTO_ONLINK altogether. Fortunately
RTO_ONLINK isn't a necessity. Instead of passing a flag in ->flowi4_tos
to tell the route lookup function to restrict the scope, we can simply
initialise the scope correctly.

Patch 1 does some preparatory work: it stops resetting ->flowi4_scope
automatically before a route lookup, thus allowing callers to set their
desired scope without having to rely on the RTO_ONLINK flag.

Patch 2-3 convert a few code paths to avoid relying on RTO_ONLINK.

More conversions will have to take place before we can eventually
remove this flag.

Guillaume Nault (3):
  ipv4: Don't reset ->flowi4_scope in ip_rt_fix_tos().
  ipv4: Avoid using RTO_ONLINK with ip_route_connect().
  ipv4: Initialise ->flowi4_scope properly in ICMP handlers.

 include/net/route.h | 36 ++++++++++++++++++++++++------------
 net/dccp/ipv4.c     |  5 ++---
 net/ipv4/af_inet.c  |  6 +++---
 net/ipv4/datagram.c |  7 +++----
 net/ipv4/route.c    | 41 +++++++++++++++++++----------------------
 net/ipv4/tcp_ipv4.c |  5 ++---
 6 files changed, 53 insertions(+), 47 deletions(-)

-- 
2.21.3

^ permalink raw reply

* [PATCH net-next 2/3] ipv4: Avoid using RTO_ONLINK with ip_route_connect().
From: Guillaume Nault @ 2022-04-20 23:21 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern, dccp
In-Reply-To: <cover.1650470610.git.gnault@redhat.com>

Now that ip_rt_fix_tos() doesn't reset ->flowi4_scope unconditionally,
we don't have to rely on the RTO_ONLINK bit to properly set the scope
of a flowi4 structure. We can just set ->flowi4_scope explicitly and
avoid using RTO_ONLINK in ->flowi4_tos.

This patch converts callers of ip_route_connect(). Instead of setting
the tos parameter with RT_CONN_FLAGS(sk), as all callers do, we can:

  1- Drop the tos parameter from ip_route_connect(): its value was
     entirely based on sk, which is also passed as parameter.

  2- Set ->flowi4_scope depending on the SOCK_LOCALROUTE socket option
     instead of always initialising it with RT_SCOPE_UNIVERSE (let's
     define ip_sock_rt_scope() for this purpose).

  3- Avoid overloading ->flowi4_tos with RTO_ONLINK: since the scope is
     now properly initialised, we don't need to tell ip_rt_fix_tos() to
     adjust ->flowi4_scope for us. So let's define ip_sock_rt_tos(),
     which is the same as RT_CONN_FLAGS() but without the RTO_ONLINK
     bit overload.

Note:
  In the original ip_route_connect() code, __ip_route_output_key()
  might clear the RTO_ONLINK bit of fl4->flowi4_tos (because of
  ip_rt_fix_tos()). Therefore flowi4_update_output() had to reuse the
  original tos variable. Now that we don't set RTO_ONLINK any more,
  this is not a problem and we can use fl4->flowi4_tos in
  flowi4_update_output().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
 include/net/route.h | 36 ++++++++++++++++++++++++------------
 net/dccp/ipv4.c     |  5 ++---
 net/ipv4/af_inet.c  |  6 +++---
 net/ipv4/datagram.c |  7 +++----
 net/ipv4/tcp_ipv4.c |  5 ++---
 5 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 25404fc2b483..991a3985712d 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -43,6 +43,19 @@
 #define RT_CONN_FLAGS(sk)   (RT_TOS(inet_sk(sk)->tos) | sock_flag(sk, SOCK_LOCALROUTE))
 #define RT_CONN_FLAGS_TOS(sk,tos)   (RT_TOS(tos) | sock_flag(sk, SOCK_LOCALROUTE))
 
+static inline __u8 ip_sock_rt_scope(const struct sock *sk)
+{
+	if (sock_flag(sk, SOCK_LOCALROUTE))
+		return RT_SCOPE_LINK;
+
+	return RT_SCOPE_UNIVERSE;
+}
+
+static inline __u8 ip_sock_rt_tos(const struct sock *sk)
+{
+	return RT_TOS(inet_sk(sk)->tos);
+}
+
 struct ip_tunnel_info;
 struct fib_nh;
 struct fib_info;
@@ -289,39 +302,38 @@ static inline char rt_tos2priority(u8 tos)
  * ip_route_newports() calls.
  */
 
-static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst, __be32 src,
-					 u32 tos, int oif, u8 protocol,
+static inline void ip_route_connect_init(struct flowi4 *fl4, __be32 dst,
+					 __be32 src, int oif, u8 protocol,
 					 __be16 sport, __be16 dport,
-					 struct sock *sk)
+					 const struct sock *sk)
 {
 	__u8 flow_flags = 0;
 
 	if (inet_sk(sk)->transparent)
 		flow_flags |= FLOWI_FLAG_ANYSRC;
 
-	flowi4_init_output(fl4, oif, sk->sk_mark, tos, RT_SCOPE_UNIVERSE,
-			   protocol, flow_flags, dst, src, dport, sport,
-			   sk->sk_uid);
+	flowi4_init_output(fl4, oif, sk->sk_mark, ip_sock_rt_tos(sk),
+			   ip_sock_rt_scope(sk), protocol, flow_flags, dst,
+			   src, dport, sport, sk->sk_uid);
 }
 
-static inline struct rtable *ip_route_connect(struct flowi4 *fl4,
-					      __be32 dst, __be32 src, u32 tos,
-					      int oif, u8 protocol,
+static inline struct rtable *ip_route_connect(struct flowi4 *fl4, __be32 dst,
+					      __be32 src, int oif, u8 protocol,
 					      __be16 sport, __be16 dport,
 					      struct sock *sk)
 {
 	struct net *net = sock_net(sk);
 	struct rtable *rt;
 
-	ip_route_connect_init(fl4, dst, src, tos, oif, protocol,
-			      sport, dport, sk);
+	ip_route_connect_init(fl4, dst, src, oif, protocol, sport, dport, sk);
 
 	if (!dst || !src) {
 		rt = __ip_route_output_key(net, fl4);
 		if (IS_ERR(rt))
 			return rt;
 		ip_rt_put(rt);
-		flowi4_update_output(fl4, oif, tos, fl4->daddr, fl4->saddr);
+		flowi4_update_output(fl4, oif, fl4->flowi4_tos, fl4->daddr,
+				     fl4->saddr);
 	}
 	security_sk_classify_flow(sk, flowi4_to_flowi_common(fl4));
 	return ip_route_output_flow(net, fl4, sk);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index ae662567a6cb..82696ab86f74 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -76,9 +76,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	orig_dport = usin->sin_port;
 	fl4 = &inet->cork.fl.u.ip4;
 	rt = ip_route_connect(fl4, nexthop, inet->inet_saddr,
-			      RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
-			      IPPROTO_DCCP,
-			      orig_sport, orig_dport, sk);
+			      sk->sk_bound_dev_if, IPPROTO_DCCP, orig_sport,
+			      orig_dport, sk);
 	if (IS_ERR(rt))
 		return PTR_ERR(rt);
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 195ecfa2f000..93da9f783bec 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1233,9 +1233,9 @@ static int inet_sk_reselect_saddr(struct sock *sk)
 
 	/* Query new route. */
 	fl4 = &inet->cork.fl.u.ip4;
-	rt = ip_route_connect(fl4, daddr, 0, RT_CONN_FLAGS(sk),
-			      sk->sk_bound_dev_if, sk->sk_protocol,
-			      inet->inet_sport, inet->inet_dport, sk);
+	rt = ip_route_connect(fl4, daddr, 0, sk->sk_bound_dev_if,
+			      sk->sk_protocol, inet->inet_sport,
+			      inet->inet_dport, sk);
 	if (IS_ERR(rt))
 		return PTR_ERR(rt);
 
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index 48f337ccf949..ffd57523331f 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -44,10 +44,9 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 			saddr = inet->mc_addr;
 	}
 	fl4 = &inet->cork.fl.u.ip4;
-	rt = ip_route_connect(fl4, usin->sin_addr.s_addr, saddr,
-			      RT_CONN_FLAGS(sk), oif,
-			      sk->sk_protocol,
-			      inet->inet_sport, usin->sin_port, sk);
+	rt = ip_route_connect(fl4, usin->sin_addr.s_addr, saddr, oif,
+			      sk->sk_protocol, inet->inet_sport,
+			      usin->sin_port, sk);
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		if (err == -ENETUNREACH)
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 157265aecbed..2c2d42142555 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -229,9 +229,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	orig_dport = usin->sin_port;
 	fl4 = &inet->cork.fl.u.ip4;
 	rt = ip_route_connect(fl4, nexthop, inet->inet_saddr,
-			      RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
-			      IPPROTO_TCP,
-			      orig_sport, orig_dport, sk);
+			      sk->sk_bound_dev_if, IPPROTO_TCP, orig_sport,
+			      orig_dport, sk);
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		if (err == -ENETUNREACH)
-- 
2.21.3


^ permalink raw reply related

* [PATCH net-next 1/3] ipv4: Don't reset ->flowi4_scope in ip_rt_fix_tos().
From: Guillaume Nault @ 2022-04-20 23:21 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, Paolo Abeni
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern, dccp
In-Reply-To: <cover.1650470610.git.gnault@redhat.com>

All callers already initialise ->flowi4_scope with RT_SCOPE_UNIVERSE,
either by manual field assignment, memset(0) of the whole structure or
implicit structure initialisation of on-stack variables
(RT_SCOPE_UNIVERSE actually equals 0).

Therefore, we don't need to always initialise ->flowi4_scope in
ip_rt_fix_tos(). We only need to reduce the scope to RT_SCOPE_LINK when
the special RTO_ONLINK flag is present in the tos.

This will allow some code simplification, like removing
ip_rt_fix_tos(). Also, the long term idea is to remove RTO_ONLINK
entirely by properly initialising ->flowi4_scope, instead of
overloading ->flowi4_tos with a special flag. Eventually, this will
allow to convert ->flowi4_tos to dscp_t.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
It's important for the correctness of this patch that all callers
initialise ->flowi4_scope to 0 (in one way or another). Auditing all of
them is long, although each case is pretty trivial.

If it helps, I can send a patch series that converts implicit
initialisation of ->flowi4_scope with an explicit assignment to
RT_SCOPE_UNIVERSE. This would also have the advantage of making it
clear to future readers that ->flowi4_scope _has_ to be initialised. I
haven't sent such patch series to not overwhelm reviewers with trivial
and not technically-required changes (there are 40+ places to modify,
scattered over 30+ different files). But if anyone prefers explicit
initialisation everywhere, then just let me know and I'll send such
patches.
---
 net/ipv4/route.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e839d424b861..d8f82c0ac132 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -503,8 +503,8 @@ static void ip_rt_fix_tos(struct flowi4 *fl4)
 	__u8 tos = RT_FL_TOS(fl4);

 	fl4->flowi4_tos = tos & IPTOS_RT_MASK;
-	fl4->flowi4_scope = tos & RTO_ONLINK ?
-			    RT_SCOPE_LINK : RT_SCOPE_UNIVERSE;
+	if (tos & RTO_ONLINK)
+		fl4->flowi4_scope = RT_SCOPE_LINK;
 }

 static void __build_flow_key(const struct net *net, struct flowi4 *fl4,
-- 
2.21.3

^ permalink raw reply related

* RE: [PATCH 5/5] Drivers: hv: vmbus: Refactor the ring-buffer iterator functions
From: Michael Kelley (LINUX) @ 2022-04-20 23:15 UTC (permalink / raw)
  To: Andrea Parri (Microsoft), KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Dexuan Cui, Stefano Garzarella,
	David Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-hyperv@vger.kernel.org,
	virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20220420200720.434717-6-parri.andrea@gmail.com>

From: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Sent: Wednesday, April 20, 2022 1:07 PM
> 
> With no users of hv_pkt_iter_next_raw() and no "external" users of
> hv_pkt_iter_first_raw(), the iterator functions can be refactored
> and simplified to remove some indirection/code.
> 
> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
> ---
>  drivers/hv/ring_buffer.c | 32 +++++++++-----------------------
>  include/linux/hyperv.h   | 35 ++++-------------------------------
>  2 files changed, 13 insertions(+), 54 deletions(-)
> 
> diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
> index 3d215d9dec433..fa98b3a91206a 100644
> --- a/drivers/hv/ring_buffer.c
> +++ b/drivers/hv/ring_buffer.c
> @@ -421,7 +421,7 @@ int hv_ringbuffer_read(struct vmbus_channel *channel,
>  	memcpy(buffer, (const char *)desc + offset, packetlen);
> 
>  	/* Advance ring index to next packet descriptor */
> -	__hv_pkt_iter_next(channel, desc, true);
> +	__hv_pkt_iter_next(channel, desc);
> 
>  	/* Notify host of update */
>  	hv_pkt_iter_close(channel);
> @@ -456,22 +456,6 @@ static u32 hv_pkt_iter_avail(const struct hv_ring_buffer_info
> *rbi)
>  		return (rbi->ring_datasize - priv_read_loc) + write_loc;
>  }
> 
> -/*
> - * Get first vmbus packet without copying it out of the ring buffer
> - */
> -struct vmpacket_descriptor *hv_pkt_iter_first_raw(struct vmbus_channel *channel)
> -{
> -	struct hv_ring_buffer_info *rbi = &channel->inbound;
> -
> -	hv_debug_delay_test(channel, MESSAGE_DELAY);
> -
> -	if (hv_pkt_iter_avail(rbi) < sizeof(struct vmpacket_descriptor))
> -		return NULL;
> -
> -	return (struct vmpacket_descriptor *)(hv_get_ring_buffer(rbi) + rbi-
> >priv_read_index);
> -}
> -EXPORT_SYMBOL_GPL(hv_pkt_iter_first_raw);
> -
>  /*
>   * Get first vmbus packet from ring buffer after read_index
>   *
> @@ -483,11 +467,14 @@ struct vmpacket_descriptor *hv_pkt_iter_first(struct
> vmbus_channel *channel)
>  	struct vmpacket_descriptor *desc, *desc_copy;
>  	u32 bytes_avail, pkt_len, pkt_offset;
> 
> -	desc = hv_pkt_iter_first_raw(channel);
> -	if (!desc)
> +	hv_debug_delay_test(channel, MESSAGE_DELAY);
> +
> +	bytes_avail = hv_pkt_iter_avail(rbi);
> +	if (bytes_avail < sizeof(struct vmpacket_descriptor))
>  		return NULL;
> +	bytes_avail = min(rbi->pkt_buffer_size, bytes_avail);
> 
> -	bytes_avail = min(rbi->pkt_buffer_size, hv_pkt_iter_avail(rbi));
> +	desc = (struct vmpacket_descriptor *)(hv_get_ring_buffer(rbi) + rbi-
> >priv_read_index);
> 
>  	/*
>  	 * Ensure the compiler does not use references to incoming Hyper-V values
> (which
> @@ -534,8 +521,7 @@ EXPORT_SYMBOL_GPL(hv_pkt_iter_first);
>   */
>  struct vmpacket_descriptor *
>  __hv_pkt_iter_next(struct vmbus_channel *channel,
> -		   const struct vmpacket_descriptor *desc,
> -		   bool copy)
> +		   const struct vmpacket_descriptor *desc)
>  {
>  	struct hv_ring_buffer_info *rbi = &channel->inbound;
>  	u32 packetlen = desc->len8 << 3;
> @@ -548,7 +534,7 @@ __hv_pkt_iter_next(struct vmbus_channel *channel,
>  		rbi->priv_read_index -= dsize;
> 
>  	/* more data? */
> -	return copy ? hv_pkt_iter_first(channel) : hv_pkt_iter_first_raw(channel);
> +	return hv_pkt_iter_first(channel);
>  }
>  EXPORT_SYMBOL_GPL(__hv_pkt_iter_next);
> 
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 1112c5cf894e6..370adc9971d3e 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1673,55 +1673,28 @@ static inline u32 hv_pkt_len(const struct
> vmpacket_descriptor *desc)
>  	return desc->len8 << 3;
>  }
> 
> -struct vmpacket_descriptor *
> -hv_pkt_iter_first_raw(struct vmbus_channel *channel);
> -
>  struct vmpacket_descriptor *
>  hv_pkt_iter_first(struct vmbus_channel *channel);
> 
>  struct vmpacket_descriptor *
>  __hv_pkt_iter_next(struct vmbus_channel *channel,
> -		   const struct vmpacket_descriptor *pkt,
> -		   bool copy);
> +		   const struct vmpacket_descriptor *pkt);
> 
>  void hv_pkt_iter_close(struct vmbus_channel *channel);
> 
>  static inline struct vmpacket_descriptor *
> -hv_pkt_iter_next_pkt(struct vmbus_channel *channel,
> -		     const struct vmpacket_descriptor *pkt,
> -		     bool copy)
> +hv_pkt_iter_next(struct vmbus_channel *channel,
> +		 const struct vmpacket_descriptor *pkt)
>  {
>  	struct vmpacket_descriptor *nxt;
> 
> -	nxt = __hv_pkt_iter_next(channel, pkt, copy);
> +	nxt = __hv_pkt_iter_next(channel, pkt);
>  	if (!nxt)
>  		hv_pkt_iter_close(channel);
> 
>  	return nxt;
>  }
> 
> -/*
> - * Get next packet descriptor without copying it out of the ring buffer
> - * If at end of list, return NULL and update host.
> - */
> -static inline struct vmpacket_descriptor *
> -hv_pkt_iter_next_raw(struct vmbus_channel *channel,
> -		     const struct vmpacket_descriptor *pkt)
> -{
> -	return hv_pkt_iter_next_pkt(channel, pkt, false);
> -}
> -
> -/*
> - * Get next packet descriptor from iterator
> - * If at end of list, return NULL and update host.
> - */
> -static inline struct vmpacket_descriptor *
> -hv_pkt_iter_next(struct vmbus_channel *channel,
> -		 const struct vmpacket_descriptor *pkt)
> -{
> -	return hv_pkt_iter_next_pkt(channel, pkt, true);
> -}
> -
>  #define foreach_vmbus_pkt(pkt, channel) \
>  	for (pkt = hv_pkt_iter_first(channel); pkt; \
>  	    pkt = hv_pkt_iter_next(channel, pkt))
> --
> 2.25.1

Reviewed-by: Michael Kelley <mikelley@microsoft.com>


^ permalink raw reply

* RE: [PATCH 4/5] Drivers: hv: vmbus: Accept hv_sock offers in isolated guests
From: Michael Kelley (LINUX) @ 2022-04-20 23:10 UTC (permalink / raw)
  To: Andrea Parri (Microsoft), KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Dexuan Cui, Stefano Garzarella,
	David Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-hyperv@vger.kernel.org,
	virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20220420200720.434717-5-parri.andrea@gmail.com>

From: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Sent: Wednesday, April 20, 2022 1:07 PM
> 
> So that isolated guests can communicate with the host via hv_sock
> channels.
> 
> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
> ---
>  drivers/hv/channel_mgmt.c | 8 ++++++--
>  include/linux/hyperv.h    | 8 ++++++--
>  2 files changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 67be81208a2d9..d800220ee54f4 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -976,13 +976,17 @@ find_primary_channel_by_offer(const struct
> vmbus_channel_offer_channel *offer)
>  	return channel;
>  }
> 
> -static bool vmbus_is_valid_device(const guid_t *guid)
> +static bool vmbus_is_valid_offer(const struct vmbus_channel_offer_channel *offer)
>  {
> +	const guid_t *guid = &offer->offer.if_type;
>  	u16 i;
> 
>  	if (!hv_is_isolation_supported())
>  		return true;
> 
> +	if (is_hvsock_offer(offer))
> +		return true;
> +
>  	for (i = 0; i < ARRAY_SIZE(vmbus_devs); i++) {
>  		if (guid_equal(guid, &vmbus_devs[i].guid))
>  			return vmbus_devs[i].allowed_in_isolated;
> @@ -1004,7 +1008,7 @@ static void vmbus_onoffer(struct
> vmbus_channel_message_header *hdr)
> 
>  	trace_vmbus_onoffer(offer);
> 
> -	if (!vmbus_is_valid_device(&offer->offer.if_type)) {
> +	if (!vmbus_is_valid_offer(offer)) {
>  		pr_err_ratelimited("Invalid offer %d from the host supporting
> isolation\n",
>  				   offer->child_relid);
>  		atomic_dec(&vmbus_connection.offer_in_progress);
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index 55478a6810b60..1112c5cf894e6 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1044,10 +1044,14 @@ struct vmbus_channel {
>  u64 vmbus_next_request_id(struct vmbus_channel *channel, u64 rqst_addr);
>  u64 vmbus_request_addr(struct vmbus_channel *channel, u64 trans_id);
> 
> +static inline bool is_hvsock_offer(const struct vmbus_channel_offer_channel *o)
> +{
> +	return !!(o->offer.chn_flags & VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER);
> +}
> +
>  static inline bool is_hvsock_channel(const struct vmbus_channel *c)
>  {
> -	return !!(c->offermsg.offer.chn_flags &
> -		  VMBUS_CHANNEL_TLNPI_PROVIDER_OFFER);
> +	return is_hvsock_offer(&c->offermsg);
>  }
> 
>  static inline bool is_sub_channel(const struct vmbus_channel *c)
> --
> 2.25.1

Reviewed-by: Michael Kelley <mikelley@microsoft.com>


^ permalink raw reply

* RE: [PATCH 3/5] hv_sock: Add validation for untrusted Hyper-V values
From: Michael Kelley (LINUX) @ 2022-04-20 23:09 UTC (permalink / raw)
  To: Andrea Parri (Microsoft), KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Dexuan Cui, Stefano Garzarella,
	David Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-hyperv@vger.kernel.org,
	virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20220420200720.434717-4-parri.andrea@gmail.com>

From: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Sent: Wednesday, April 20, 2022 1:07 PM
> 
> For additional robustness in the face of Hyper-V errors or malicious
> behavior, validate all values that originate from packets that Hyper-V
> has sent to the guest in the host-to-guest ring buffer.  Ensure that
> invalid values cannot cause data being copied out of the bounds of the
> source buffer in hvs_stream_dequeue().
> 
> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
> ---
>  include/linux/hyperv.h           |  5 +++++
>  net/vmw_vsock/hyperv_transport.c | 11 +++++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index fe2e0179ed51e..55478a6810b60 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1663,6 +1663,11 @@ static inline u32 hv_pkt_datalen(const struct
> vmpacket_descriptor *desc)
>  	return (desc->len8 << 3) - (desc->offset8 << 3);
>  }
> 
> +/* Get packet length associated with descriptor */
> +static inline u32 hv_pkt_len(const struct vmpacket_descriptor *desc)
> +{
> +	return desc->len8 << 3;
> +}
> 
>  struct vmpacket_descriptor *
>  hv_pkt_iter_first_raw(struct vmbus_channel *channel);
> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> index 8c37d07017fc4..092cadc2c866d 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -577,12 +577,19 @@ static bool hvs_dgram_allow(u32 cid, u32 port)
>  static int hvs_update_recv_data(struct hvsock *hvs)
>  {
>  	struct hvs_recv_buf *recv_buf;
> -	u32 payload_len;
> +	u32 pkt_len, payload_len;
> +
> +	pkt_len = hv_pkt_len(hvs->recv_desc);
> +
> +	/* Ensure the packet is big enough to read its header */
> +	if (pkt_len < HVS_HEADER_LEN)
> +		return -EIO;
> 
>  	recv_buf = (struct hvs_recv_buf *)(hvs->recv_desc + 1);
>  	payload_len = recv_buf->hdr.data_size;
> 
> -	if (payload_len > HVS_MTU_SIZE)
> +	/* Ensure the packet is big enough to read its payload */
> +	if (payload_len > pkt_len - HVS_HEADER_LEN || payload_len > HVS_MTU_SIZE)
>  		return -EIO;
> 
>  	if (payload_len == 0)
> --
> 2.25.1

Reviewed-by: Michael Kelley <mikelley@microsoft.com>


^ permalink raw reply

* RE: [PATCH 2/5] hv_sock: Copy packets sent by Hyper-V out of the ring buffer
From: Michael Kelley (LINUX) @ 2022-04-20 23:08 UTC (permalink / raw)
  To: Andrea Parri (Microsoft), KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Dexuan Cui, Stefano Garzarella,
	David Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-hyperv@vger.kernel.org,
	virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20220420200720.434717-3-parri.andrea@gmail.com>

From: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Sent: Wednesday, April 20, 2022 1:07 PM
> 
> Pointers to VMbus packets sent by Hyper-V are used by the hv_sock driver
> within the guest VM.  Hyper-V can send packets with erroneous values or
> modify packet fields after they are processed by the guest.  To defend
> against these scenarios, copy the incoming packet after validating its
> length and offset fields using hv_pkt_iter_{first,next}().  In this way,
> the packet can no longer be modified by the host.
> 
> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
> ---
>  net/vmw_vsock/hyperv_transport.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> index 943352530936e..8c37d07017fc4 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -78,6 +78,9 @@ struct hvs_send_buf {
>  					 ALIGN((payload_len), 8) + \
>  					 VMBUS_PKT_TRAILER_SIZE)
> 
> +/* Upper bound on the size of a VMbus packet for hv_sock */
> +#define HVS_MAX_PKT_SIZE	HVS_PKT_LEN(HVS_MTU_SIZE)
> +
>  union hvs_service_id {
>  	guid_t	srv_id;
> 
> @@ -378,6 +381,8 @@ static void hvs_open_connection(struct vmbus_channel *chan)
>  		rcvbuf = ALIGN(rcvbuf, HV_HYP_PAGE_SIZE);
>  	}
> 
> +	chan->max_pkt_size = HVS_MAX_PKT_SIZE;
> +
>  	ret = vmbus_open(chan, sndbuf, rcvbuf, NULL, 0, hvs_channel_cb,
>  			 conn_from_host ? new : sk);
>  	if (ret != 0) {
> @@ -602,7 +607,7 @@ static ssize_t hvs_stream_dequeue(struct vsock_sock *vsk,
> struct msghdr *msg,
>  		return -EOPNOTSUPP;
> 
>  	if (need_refill) {
> -		hvs->recv_desc = hv_pkt_iter_first_raw(hvs->chan);
> +		hvs->recv_desc = hv_pkt_iter_first(hvs->chan);
>  		if (!hvs->recv_desc)
>  			return -ENOBUFS;
>  		ret = hvs_update_recv_data(hvs);
> @@ -618,7 +623,7 @@ static ssize_t hvs_stream_dequeue(struct vsock_sock *vsk,
> struct msghdr *msg,
> 
>  	hvs->recv_data_len -= to_read;
>  	if (hvs->recv_data_len == 0) {
> -		hvs->recv_desc = hv_pkt_iter_next_raw(hvs->chan, hvs->recv_desc);
> +		hvs->recv_desc = hv_pkt_iter_next(hvs->chan, hvs->recv_desc);
>  		if (hvs->recv_desc) {
>  			ret = hvs_update_recv_data(hvs);
>  			if (ret)
> --
> 2.25.1

Reviewed-by: Michael Kelley <mikelley@microsoft.com>


^ permalink raw reply

* RE: [PATCH 1/5] hv_sock: Check hv_pkt_iter_first_raw()'s return value
From: Michael Kelley (LINUX) @ 2022-04-20 23:07 UTC (permalink / raw)
  To: Andrea Parri (Microsoft), KY Srinivasan, Haiyang Zhang,
	Stephen Hemminger, Wei Liu, Dexuan Cui, Stefano Garzarella,
	David Miller, Jakub Kicinski, Paolo Abeni
  Cc: linux-hyperv@vger.kernel.org,
	virtualization@lists.linux-foundation.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20220420200720.434717-2-parri.andrea@gmail.com>

From: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Sent: Wednesday, April 20, 2022 1:07 PM
> 
> The function returns NULL if the ring buffer doesn't contain enough
> readable bytes to constitute a packet descriptor.  The ring buffer's
> write_index is in memory which is shared with the Hyper-V host, an
> erroneous or malicious host could thus change its value and overturn
> the result of hvs_stream_has_data().
> 
> Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com>
> ---
>  net/vmw_vsock/hyperv_transport.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
> index e111e13b66604..943352530936e 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -603,6 +603,8 @@ static ssize_t hvs_stream_dequeue(struct vsock_sock *vsk,
> struct msghdr *msg,
> 
>  	if (need_refill) {
>  		hvs->recv_desc = hv_pkt_iter_first_raw(hvs->chan);
> +		if (!hvs->recv_desc)
> +			return -ENOBUFS;
>  		ret = hvs_update_recv_data(hvs);
>  		if (ret)
>  			return ret;
> --
> 2.25.1

Reviewed-by: Michael Kelley <mikelley@microsoft.com>


^ permalink raw reply

* Re: [PATCH bpf-next] bpf: use bpf_prog_run_array_cg_flags everywhere
From: Stanislav Fomichev @ 2022-04-20 22:30 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Networking, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko
In-Reply-To: <CAEf4BzYoA4xvqv7SaM2TvcbKef=m4n6TSGVNA34T2we05fRwpw@mail.gmail.com>

On Wed, Apr 20, 2022 at 3:04 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Apr 19, 2022 at 3:23 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and
> > use it everywhere. check_return_code already enforces sane
> > return ranges for all cgroup types. (only egress and bind hooks have
> > uncanonical return ranges, the rest is using [0, 1])
> >
> > No functional changes.
> >
> > Suggested-by: Alexei Starovoitov <ast@kernel.org>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/linux/bpf-cgroup.h |  8 ++---
> >  kernel/bpf/cgroup.c        | 70 ++++++++++++--------------------------
> >  2 files changed, 24 insertions(+), 54 deletions(-)
> >
> > diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
> > index 88a51b242adc..669d96d074ad 100644
> > --- a/include/linux/bpf-cgroup.h
> > +++ b/include/linux/bpf-cgroup.h
> > @@ -225,24 +225,20 @@ static inline bool cgroup_bpf_sock_enabled(struct sock *sk,
> >
> >  #define BPF_CGROUP_RUN_SA_PROG(sk, uaddr, atype)                                      \
> >  ({                                                                            \
> > -       u32 __unused_flags;                                                    \
> >         int __ret = 0;                                                         \
> >         if (cgroup_bpf_enabled(atype))                                         \
> >                 __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, atype,     \
> > -                                                         NULL,                \
> > -                                                         &__unused_flags);    \
> > +                                                         NULL, NULL);         \
> >         __ret;                                                                 \
> >  })
> >
> >  #define BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, atype, t_ctx)                  \
> >  ({                                                                            \
> > -       u32 __unused_flags;                                                    \
> >         int __ret = 0;                                                         \
> >         if (cgroup_bpf_enabled(atype))  {                                      \
> >                 lock_sock(sk);                                                 \
> >                 __ret = __cgroup_bpf_run_filter_sock_addr(sk, uaddr, atype,     \
> > -                                                         t_ctx,               \
> > -                                                         &__unused_flags);    \
> > +                                                         t_ctx, NULL);        \
> >                 release_sock(sk);                                              \
> >         }                                                                      \
> >         __ret;                                                                 \
> > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> > index 0cb6211fcb58..f61eca32c747 100644
> > --- a/kernel/bpf/cgroup.c
> > +++ b/kernel/bpf/cgroup.c
> > @@ -25,50 +25,18 @@ EXPORT_SYMBOL(cgroup_bpf_enabled_key);
> >  /* __always_inline is necessary to prevent indirect call through run_prog
> >   * function pointer.
> >   */
> > -static __always_inline int
> > -bpf_prog_run_array_cg_flags(const struct cgroup_bpf *cgrp,
> > -                           enum cgroup_bpf_attach_type atype,
> > -                           const void *ctx, bpf_prog_run_fn run_prog,
> > -                           int retval, u32 *ret_flags)
> > -{
> > -       const struct bpf_prog_array_item *item;
> > -       const struct bpf_prog *prog;
> > -       const struct bpf_prog_array *array;
> > -       struct bpf_run_ctx *old_run_ctx;
> > -       struct bpf_cg_run_ctx run_ctx;
> > -       u32 func_ret;
> > -
> > -       run_ctx.retval = retval;
> > -       migrate_disable();
> > -       rcu_read_lock();
> > -       array = rcu_dereference(cgrp->effective[atype]);
> > -       item = &array->items[0];
> > -       old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
> > -       while ((prog = READ_ONCE(item->prog))) {
> > -               run_ctx.prog_item = item;
> > -               func_ret = run_prog(prog, ctx);
> > -               if (!(func_ret & 1) && !IS_ERR_VALUE((long)run_ctx.retval))
> > -                       run_ctx.retval = -EPERM;
> > -               *(ret_flags) |= (func_ret >> 1);
> > -               item++;
> > -       }
> > -       bpf_reset_run_ctx(old_run_ctx);
> > -       rcu_read_unlock();
> > -       migrate_enable();
> > -       return run_ctx.retval;
> > -}
> > -
> >  static __always_inline int
> >  bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
> >                       enum cgroup_bpf_attach_type atype,
> >                       const void *ctx, bpf_prog_run_fn run_prog,
> > -                     int retval)
> > +                     int retval, u32 *ret_flags)
> >  {
> >         const struct bpf_prog_array_item *item;
> >         const struct bpf_prog *prog;
> >         const struct bpf_prog_array *array;
> >         struct bpf_run_ctx *old_run_ctx;
> >         struct bpf_cg_run_ctx run_ctx;
> > +       u32 func_ret;
> >
> >         run_ctx.retval = retval;
> >         migrate_disable();
> > @@ -78,8 +46,11 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
> >         old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx);
> >         while ((prog = READ_ONCE(item->prog))) {
> >                 run_ctx.prog_item = item;
> > -               if (!run_prog(prog, ctx) && !IS_ERR_VALUE((long)run_ctx.retval))
> > +               func_ret = run_prog(prog, ctx);
> > +               if (!(func_ret & 1) && !IS_ERR_VALUE((long)run_ctx.retval))
>
> to be completely true to previous behavior, shouldn't there be
>
> if (ret_flags)
>     func_ret &= 1;
> if (!func_ret && !IS_ERR_VALUE(...))
>
> here?
>
> This might have been discussed previously and I missed it. If that's
> so, please ignore.

We are converting the cases where run_prog(prog, ctx) returns 0 or 1,
so it seems like we don't have to reproduce the existing behavior
1-to-1?
So I'm not sure it matters, or am I missing something?

^ permalink raw reply

* [PATCH bpf-next 7/7] selftests: bpf: verify first of struct mptcp_sock
From: Mat Martineau @ 2022-04-20 22:24 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Geliang Tang, ast, daniel, andrii, mptcp, Matthieu Baerts,
	Mat Martineau
In-Reply-To: <20220420222459.307649-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch verifies the 'first' struct member of struct mptcp_sock, which
points to the first subflow of msk. Save 'sk' in mptcp_storage, and verify
it with 'first' in verify_msk().

Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 tools/testing/selftests/bpf/bpf_mptcp_helpers.h | 1 +
 tools/testing/selftests/bpf/prog_tests/mptcp.c  | 8 ++++++++
 tools/testing/selftests/bpf/progs/mptcp_sock.c  | 5 +++++
 3 files changed, 14 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_mptcp_helpers.h b/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
index 463e4e061c96..b5a43b108982 100644
--- a/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
@@ -10,6 +10,7 @@ struct mptcp_sock {
 	struct inet_connection_sock	sk;
 
 	__u32		token;
+	struct sock	*first;
 	char		ca_name[TCP_CA_NAME_MAX];
 } __attribute__((preserve_access_index));
 
diff --git a/tools/testing/selftests/bpf/prog_tests/mptcp.c b/tools/testing/selftests/bpf/prog_tests/mptcp.c
index 4518aa6e661e..7e704f5aab05 100644
--- a/tools/testing/selftests/bpf/prog_tests/mptcp.c
+++ b/tools/testing/selftests/bpf/prog_tests/mptcp.c
@@ -12,7 +12,9 @@
 struct mptcp_storage {
 	__u32 invoked;
 	__u32 is_mptcp;
+	struct sock *sk;
 	__u32 token;
+	struct sock *first;
 	char ca_name[TCP_CA_NAME_MAX];
 };
 
@@ -133,6 +135,12 @@ static int verify_msk(int map_fd, int client_fd)
 		err++;
 	}
 
+	if (val.first != val.sk) {
+		log_err("Unexpected mptcp_sock.first %p != %p",
+			val.first, val.sk);
+		err++;
+	}
+
 	if (strncmp(val.ca_name, ca_name, TCP_CA_NAME_MAX)) {
 		log_err("Unexpected mptcp_sock.ca_name %s != %s",
 			val.ca_name, ca_name);
diff --git a/tools/testing/selftests/bpf/progs/mptcp_sock.c b/tools/testing/selftests/bpf/progs/mptcp_sock.c
index 53bf67580010..1d1ac7cda2a2 100644
--- a/tools/testing/selftests/bpf/progs/mptcp_sock.c
+++ b/tools/testing/selftests/bpf/progs/mptcp_sock.c
@@ -12,7 +12,9 @@ __u32 _version SEC("version") = 1;
 struct mptcp_storage {
 	__u32 invoked;
 	__u32 is_mptcp;
+	struct sock *sk;
 	__u32 token;
+	struct sock *first;
 	char ca_name[TCP_CA_NAME_MAX];
 };
 
@@ -51,6 +53,7 @@ int _sockops(struct bpf_sock_ops *ctx)
 
 		storage->token = 0;
 		bzero(storage->ca_name, TCP_CA_NAME_MAX);
+		storage->first = NULL;
 	} else {
 		msk = bpf_skc_to_mptcp_sock(sk);
 		if (!msk)
@@ -63,9 +66,11 @@ int _sockops(struct bpf_sock_ops *ctx)
 
 		storage->token = msk->token;
 		memcpy(storage->ca_name, msk->ca_name, TCP_CA_NAME_MAX);
+		storage->first = msk->first;
 	}
 	storage->invoked++;
 	storage->is_mptcp = tcp_sk->is_mptcp;
+	storage->sk = (struct sock *)sk;
 
 	return 1;
 }
-- 
2.36.0


^ permalink raw reply related

* [PATCH bpf-next 5/7] selftests: bpf: verify token of struct mptcp_sock
From: Mat Martineau @ 2022-04-20 22:24 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Geliang Tang, ast, daniel, andrii, mptcp, Matthieu Baerts,
	Mat Martineau
In-Reply-To: <20220420222459.307649-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch verifies the struct member token of struct mptcp_sock. Add a
new function get_msk_token() to parse the msk token from the output of
the command 'ip mptcp monitor', and verify it in verify_msk().

Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 .../testing/selftests/bpf/bpf_mptcp_helpers.h |  1 +
 .../testing/selftests/bpf/prog_tests/mptcp.c  | 66 +++++++++++++++++++
 .../testing/selftests/bpf/progs/mptcp_sock.c  |  5 ++
 3 files changed, 72 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_mptcp_helpers.h b/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
index 18da4cc65e89..87e15810997d 100644
--- a/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
@@ -9,6 +9,7 @@
 struct mptcp_sock {
 	struct inet_connection_sock	sk;
 
+	__u32		token;
 } __attribute__((preserve_access_index));
 
 #endif
diff --git a/tools/testing/selftests/bpf/prog_tests/mptcp.c b/tools/testing/selftests/bpf/prog_tests/mptcp.c
index 4b40bbdaf91f..c5d96ba81e04 100644
--- a/tools/testing/selftests/bpf/prog_tests/mptcp.c
+++ b/tools/testing/selftests/bpf/prog_tests/mptcp.c
@@ -8,8 +8,11 @@
 struct mptcp_storage {
 	__u32 invoked;
 	__u32 is_mptcp;
+	__u32 token;
 };
 
+static char monitor_log_path[64];
+
 static int verify_tsk(int map_fd, int client_fd)
 {
 	char *msg = "plain TCP socket";
@@ -36,11 +39,58 @@ static int verify_tsk(int map_fd, int client_fd)
 	return err;
 }
 
+/*
+ * Parse the token from the output of 'ip mptcp monitor':
+ *
+ * [       CREATED] token=3ca933d3 remid=0 locid=0 saddr4=127.0.0.1 ...
+ * [       CREATED] token=2ab57040 remid=0 locid=0 saddr4=127.0.0.1 ...
+ */
+static __u32 get_msk_token(void)
+{
+	char *prefix = "[       CREATED] token=";
+	char buf[BUFSIZ] = {};
+	__u32 token = 0;
+	ssize_t len;
+	int fd;
+
+	sync();
+
+	fd = open(monitor_log_path, O_RDONLY);
+	if (CHECK_FAIL(fd < 0)) {
+		log_err("Failed to open %s", monitor_log_path);
+		return token;
+	}
+
+	len = read(fd, buf, sizeof(buf));
+	if (CHECK_FAIL(len < 0)) {
+		log_err("Failed to read %s", monitor_log_path);
+		goto err;
+	}
+
+	if (strncmp(buf, prefix, strlen(prefix))) {
+		log_err("Invalid prefix %s", buf);
+		goto err;
+	}
+
+	token = strtol(buf + strlen(prefix), NULL, 16);
+
+err:
+	close(fd);
+	return token;
+}
+
 static int verify_msk(int map_fd, int client_fd)
 {
 	char *msg = "MPTCP subflow socket";
 	int err = 0, cfd = client_fd;
 	struct mptcp_storage val;
+	__u32 token;
+
+	token = get_msk_token();
+	if (token <= 0) {
+		log_err("Unexpected token %x", token);
+		return -1;
+	}
 
 	if (CHECK_FAIL(bpf_map_lookup_elem(map_fd, &cfd, &val) < 0)) {
 		perror("Failed to read socket storage");
@@ -59,6 +109,12 @@ static int verify_msk(int map_fd, int client_fd)
 		err++;
 	}
 
+	if (val.token != token) {
+		log_err("Unexpected mptcp_sock.token %x != %x",
+			val.token, token);
+		err++;
+	}
+
 	return err;
 }
 
@@ -124,6 +180,7 @@ static int run_test(int cgroup_fd, int server_fd, bool is_mptcp)
 
 void test_base(void)
 {
+	char cmd[256], tmp_dir[] = "/tmp/XXXXXX";
 	int server_fd, cgroup_fd;
 
 	cgroup_fd = test__join_cgroup("/mptcp");
@@ -141,6 +198,13 @@ void test_base(void)
 
 with_mptcp:
 	/* with MPTCP */
+	if (CHECK_FAIL(!mkdtemp(tmp_dir)))
+		goto close_cgroup_fd;
+	snprintf(monitor_log_path, sizeof(monitor_log_path),
+		 "%s/ip_mptcp_monitor", tmp_dir);
+	snprintf(cmd, sizeof(cmd), "ip mptcp monitor > %s &", monitor_log_path);
+	if (CHECK_FAIL(system(cmd)))
+		goto close_cgroup_fd;
 	server_fd = start_mptcp_server(AF_INET, NULL, 0, 0);
 	if (CHECK_FAIL(server_fd < 0))
 		goto close_cgroup_fd;
@@ -148,6 +212,8 @@ void test_base(void)
 	CHECK_FAIL(run_test(cgroup_fd, server_fd, true));
 
 	close(server_fd);
+	snprintf(cmd, sizeof(cmd), "rm -rf %s", tmp_dir);
+	system(cmd);
 
 close_cgroup_fd:
 	close(cgroup_fd);
diff --git a/tools/testing/selftests/bpf/progs/mptcp_sock.c b/tools/testing/selftests/bpf/progs/mptcp_sock.c
index 5cfaec4e7245..56cc7c61131b 100644
--- a/tools/testing/selftests/bpf/progs/mptcp_sock.c
+++ b/tools/testing/selftests/bpf/progs/mptcp_sock.c
@@ -11,6 +11,7 @@ __u32 _version SEC("version") = 1;
 struct mptcp_storage {
 	__u32 invoked;
 	__u32 is_mptcp;
+	__u32 token;
 };
 
 struct {
@@ -45,6 +46,8 @@ int _sockops(struct bpf_sock_ops *ctx)
 					     BPF_SK_STORAGE_GET_F_CREATE);
 		if (!storage)
 			return 1;
+
+		storage->token = 0;
 	} else {
 		msk = bpf_skc_to_mptcp_sock(sk);
 		if (!msk)
@@ -54,6 +57,8 @@ int _sockops(struct bpf_sock_ops *ctx)
 					     BPF_SK_STORAGE_GET_F_CREATE);
 		if (!storage)
 			return 1;
+
+		storage->token = msk->token;
 	}
 	storage->invoked++;
 	storage->is_mptcp = tcp_sk->is_mptcp;
-- 
2.36.0


^ permalink raw reply related

* [PATCH bpf-next 4/7] selftests: bpf: test bpf_skc_to_mptcp_sock
From: Mat Martineau @ 2022-04-20 22:24 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Geliang Tang, ast, daniel, andrii, mptcp, Matthieu Baerts,
	Mat Martineau
In-Reply-To: <20220420222459.307649-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch extends the MPTCP test base, to test the new helper
bpf_skc_to_mptcp_sock().

Define struct mptcp_sock in bpf_tcp_helpers.h, use bpf_skc_to_mptcp_sock
to get the msk socket in progs/mptcp_sock.c and store the infos in
socket_storage_map.

Get the infos from socket_storage_map in prog_tests/mptcp.c. Add a new
function verify_msk() to verify the infos of MPTCP socket, and rename
verify_sk() to verify_tsk() to verify TCP socket only.

Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 MAINTAINERS                                   |  1 +
 .../testing/selftests/bpf/bpf_mptcp_helpers.h | 14 ++++++++
 .../testing/selftests/bpf/prog_tests/mptcp.c  | 36 +++++++++++++++----
 .../testing/selftests/bpf/progs/mptcp_sock.c  | 22 +++++++++---
 4 files changed, 62 insertions(+), 11 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/bpf_mptcp_helpers.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 01fbdb0e0180..f31fa73266f9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13764,6 +13764,7 @@ F:	include/net/mptcp.h
 F:	include/trace/events/mptcp.h
 F:	include/uapi/linux/mptcp.h
 F:	net/mptcp/
+F:	tools/testing/selftests/bpf/bpf_mptcp_helpers.h
 F:	tools/testing/selftests/bpf/*/*mptcp*.c
 F:	tools/testing/selftests/net/mptcp/
 
diff --git a/tools/testing/selftests/bpf/bpf_mptcp_helpers.h b/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
new file mode 100644
index 000000000000..18da4cc65e89
--- /dev/null
+++ b/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2022, SUSE. */
+
+#ifndef __BPF_MPTCP_HELPERS_H
+#define __BPF_MPTCP_HELPERS_H
+
+#include "bpf_tcp_helpers.h"
+
+struct mptcp_sock {
+	struct inet_connection_sock	sk;
+
+} __attribute__((preserve_access_index));
+
+#endif
diff --git a/tools/testing/selftests/bpf/prog_tests/mptcp.c b/tools/testing/selftests/bpf/prog_tests/mptcp.c
index cd548bb2828f..4b40bbdaf91f 100644
--- a/tools/testing/selftests/bpf/prog_tests/mptcp.c
+++ b/tools/testing/selftests/bpf/prog_tests/mptcp.c
@@ -10,14 +10,12 @@ struct mptcp_storage {
 	__u32 is_mptcp;
 };
 
-static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 is_mptcp)
+static int verify_tsk(int map_fd, int client_fd)
 {
+	char *msg = "plain TCP socket";
 	int err = 0, cfd = client_fd;
 	struct mptcp_storage val;
 
-	if (is_mptcp == 1)
-		return 0;
-
 	if (CHECK_FAIL(bpf_map_lookup_elem(map_fd, &cfd, &val) < 0)) {
 		perror("Failed to read socket storage");
 		return -1;
@@ -38,6 +36,32 @@ static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 is_mptcp)
 	return err;
 }
 
+static int verify_msk(int map_fd, int client_fd)
+{
+	char *msg = "MPTCP subflow socket";
+	int err = 0, cfd = client_fd;
+	struct mptcp_storage val;
+
+	if (CHECK_FAIL(bpf_map_lookup_elem(map_fd, &cfd, &val) < 0)) {
+		perror("Failed to read socket storage");
+		return -1;
+	}
+
+	if (val.invoked != 1) {
+		log_err("%s: unexpected invoked count %d != 1",
+			msg, val.invoked);
+		err++;
+	}
+
+	if (val.is_mptcp != 1) {
+		log_err("%s: unexpected bpf_tcp_sock.is_mptcp %d != 1",
+			msg, val.is_mptcp);
+		err++;
+	}
+
+	return err;
+}
+
 static int run_test(int cgroup_fd, int server_fd, bool is_mptcp)
 {
 	int client_fd, prog_fd, map_fd, err;
@@ -88,8 +112,8 @@ static int run_test(int cgroup_fd, int server_fd, bool is_mptcp)
 		goto out;
 	}
 
-	err += is_mptcp ? verify_sk(map_fd, client_fd, "MPTCP subflow socket", 1) :
-			  verify_sk(map_fd, client_fd, "plain TCP socket", 0);
+	err += is_mptcp ? verify_msk(map_fd, client_fd) :
+			  verify_tsk(map_fd, client_fd);
 
 	close(client_fd);
 
diff --git a/tools/testing/selftests/bpf/progs/mptcp_sock.c b/tools/testing/selftests/bpf/progs/mptcp_sock.c
index 0d65fb889d03..5cfaec4e7245 100644
--- a/tools/testing/selftests/bpf/progs/mptcp_sock.c
+++ b/tools/testing/selftests/bpf/progs/mptcp_sock.c
@@ -3,6 +3,7 @@
 
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>
+#include "bpf_mptcp_helpers.h"
 
 char _license[] SEC("license") = "GPL";
 __u32 _version SEC("version") = 1;
@@ -24,6 +25,7 @@ int _sockops(struct bpf_sock_ops *ctx)
 {
 	struct mptcp_storage *storage;
 	struct bpf_tcp_sock *tcp_sk;
+	struct mptcp_sock *msk;
 	int op = (int)ctx->op;
 	struct bpf_sock *sk;
 
@@ -38,11 +40,21 @@ int _sockops(struct bpf_sock_ops *ctx)
 	if (!tcp_sk)
 		return 1;
 
-	storage = bpf_sk_storage_get(&socket_storage_map, sk, 0,
-				     BPF_SK_STORAGE_GET_F_CREATE);
-	if (!storage)
-		return 1;
-
+	if (!tcp_sk->is_mptcp) {
+		storage = bpf_sk_storage_get(&socket_storage_map, sk, 0,
+					     BPF_SK_STORAGE_GET_F_CREATE);
+		if (!storage)
+			return 1;
+	} else {
+		msk = bpf_skc_to_mptcp_sock(sk);
+		if (!msk)
+			return 1;
+
+		storage = bpf_sk_storage_get(&socket_storage_map, msk, 0,
+					     BPF_SK_STORAGE_GET_F_CREATE);
+		if (!storage)
+			return 1;
+	}
 	storage->invoked++;
 	storage->is_mptcp = tcp_sk->is_mptcp;
 
-- 
2.36.0


^ permalink raw reply related

* [PATCH bpf-next 6/7] selftests: bpf: verify ca_name of struct mptcp_sock
From: Mat Martineau @ 2022-04-20 22:24 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Geliang Tang, ast, daniel, andrii, mptcp, Matthieu Baerts,
	Mat Martineau
In-Reply-To: <20220420222459.307649-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch verifies another member of struct mptcp_sock, ca_name. Add a
new function get_msk_ca_name() to read the sysctl tcp_congestion_control
and verify it in verify_msk().

Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 .../testing/selftests/bpf/bpf_mptcp_helpers.h |  1 +
 tools/testing/selftests/bpf/bpf_tcp_helpers.h |  4 ++++
 .../testing/selftests/bpf/prog_tests/mptcp.c  | 24 +++++++++++++++++++
 .../testing/selftests/bpf/progs/mptcp_sock.c  |  4 ++++
 4 files changed, 33 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_mptcp_helpers.h b/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
index 87e15810997d..463e4e061c96 100644
--- a/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_mptcp_helpers.h
@@ -10,6 +10,7 @@ struct mptcp_sock {
 	struct inet_connection_sock	sk;
 
 	__u32		token;
+	char		ca_name[TCP_CA_NAME_MAX];
 } __attribute__((preserve_access_index));
 
 #endif
diff --git a/tools/testing/selftests/bpf/bpf_tcp_helpers.h b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
index b1ede6f0b821..89750d732cfa 100644
--- a/tools/testing/selftests/bpf/bpf_tcp_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_tcp_helpers.h
@@ -16,6 +16,10 @@ BPF_PROG(name, args)
 #define SOL_TCP 6
 #endif
 
+#ifndef TCP_CA_NAME_MAX
+#define TCP_CA_NAME_MAX	16
+#endif
+
 #define tcp_jiffies32 ((__u32)bpf_jiffies64())
 
 struct sock_common {
diff --git a/tools/testing/selftests/bpf/prog_tests/mptcp.c b/tools/testing/selftests/bpf/prog_tests/mptcp.c
index c5d96ba81e04..4518aa6e661e 100644
--- a/tools/testing/selftests/bpf/prog_tests/mptcp.c
+++ b/tools/testing/selftests/bpf/prog_tests/mptcp.c
@@ -5,10 +5,15 @@
 #include "cgroup_helpers.h"
 #include "network_helpers.h"
 
+#ifndef TCP_CA_NAME_MAX
+#define TCP_CA_NAME_MAX	16
+#endif
+
 struct mptcp_storage {
 	__u32 invoked;
 	__u32 is_mptcp;
 	__u32 token;
+	char ca_name[TCP_CA_NAME_MAX];
 };
 
 static char monitor_log_path[64];
@@ -79,11 +84,22 @@ static __u32 get_msk_token(void)
 	return token;
 }
 
+void get_msk_ca_name(char ca_name[])
+{
+	FILE *stream = popen("sysctl -b net.ipv4.tcp_congestion_control", "r");
+
+	if (!fgets(ca_name, TCP_CA_NAME_MAX, stream))
+		log_err("Failed to read ca_name");
+
+	pclose(stream);
+}
+
 static int verify_msk(int map_fd, int client_fd)
 {
 	char *msg = "MPTCP subflow socket";
 	int err = 0, cfd = client_fd;
 	struct mptcp_storage val;
+	char ca_name[TCP_CA_NAME_MAX];
 	__u32 token;
 
 	token = get_msk_token();
@@ -92,6 +108,8 @@ static int verify_msk(int map_fd, int client_fd)
 		return -1;
 	}
 
+	get_msk_ca_name(ca_name);
+
 	if (CHECK_FAIL(bpf_map_lookup_elem(map_fd, &cfd, &val) < 0)) {
 		perror("Failed to read socket storage");
 		return -1;
@@ -115,6 +133,12 @@ static int verify_msk(int map_fd, int client_fd)
 		err++;
 	}
 
+	if (strncmp(val.ca_name, ca_name, TCP_CA_NAME_MAX)) {
+		log_err("Unexpected mptcp_sock.ca_name %s != %s",
+			val.ca_name, ca_name);
+		err++;
+	}
+
 	return err;
 }
 
diff --git a/tools/testing/selftests/bpf/progs/mptcp_sock.c b/tools/testing/selftests/bpf/progs/mptcp_sock.c
index 56cc7c61131b..53bf67580010 100644
--- a/tools/testing/selftests/bpf/progs/mptcp_sock.c
+++ b/tools/testing/selftests/bpf/progs/mptcp_sock.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright (c) 2020, Tessares SA. */
 
+#include <string.h>
 #include <linux/bpf.h>
 #include <bpf/bpf_helpers.h>
 #include "bpf_mptcp_helpers.h"
@@ -12,6 +13,7 @@ struct mptcp_storage {
 	__u32 invoked;
 	__u32 is_mptcp;
 	__u32 token;
+	char ca_name[TCP_CA_NAME_MAX];
 };
 
 struct {
@@ -48,6 +50,7 @@ int _sockops(struct bpf_sock_ops *ctx)
 			return 1;
 
 		storage->token = 0;
+		bzero(storage->ca_name, TCP_CA_NAME_MAX);
 	} else {
 		msk = bpf_skc_to_mptcp_sock(sk);
 		if (!msk)
@@ -59,6 +62,7 @@ int _sockops(struct bpf_sock_ops *ctx)
 			return 1;
 
 		storage->token = msk->token;
+		memcpy(storage->ca_name, msk->ca_name, TCP_CA_NAME_MAX);
 	}
 	storage->invoked++;
 	storage->is_mptcp = tcp_sk->is_mptcp;
-- 
2.36.0


^ permalink raw reply related

* [PATCH bpf-next 3/7] selftests: bpf: add MPTCP test base
From: Mat Martineau @ 2022-04-20 22:24 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Nicolas Rybowski, ast, daniel, andrii, mptcp, Matthieu Baerts,
	Geliang Tang, Mat Martineau
In-Reply-To: <20220420222459.307649-1-mathew.j.martineau@linux.intel.com>

From: Nicolas Rybowski <nicolas.rybowski@tessares.net>

This patch adds a base for MPTCP specific tests.

It is currently limited to the is_mptcp field in case of plain TCP
connection because there is no easy way to get the subflow sk from a msk
in userspace. This implies that we cannot lookup the sk_storage attached
to the subflow sk in the sockops program.

Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 MAINTAINERS                                   |   1 +
 tools/testing/selftests/bpf/config            |   1 +
 tools/testing/selftests/bpf/network_helpers.c |  43 ++++--
 tools/testing/selftests/bpf/network_helpers.h |   4 +
 .../testing/selftests/bpf/prog_tests/mptcp.c  | 136 ++++++++++++++++++
 .../testing/selftests/bpf/progs/mptcp_sock.c  |  50 +++++++
 6 files changed, 227 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/mptcp.c
 create mode 100644 tools/testing/selftests/bpf/progs/mptcp_sock.c

diff --git a/MAINTAINERS b/MAINTAINERS
index fd768d43e048..01fbdb0e0180 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13764,6 +13764,7 @@ F:	include/net/mptcp.h
 F:	include/trace/events/mptcp.h
 F:	include/uapi/linux/mptcp.h
 F:	net/mptcp/
+F:	tools/testing/selftests/bpf/*/*mptcp*.c
 F:	tools/testing/selftests/net/mptcp/
 
 NETWORKING [TCP]
diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 763db63a3890..fe0d3ad6ecd8 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -53,3 +53,4 @@ CONFIG_NF_DEFRAG_IPV4=y
 CONFIG_NF_DEFRAG_IPV6=y
 CONFIG_NF_CONNTRACK=y
 CONFIG_USERFAULTFD=y
+CONFIG_MPTCP=y
diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index 2bb1f9b3841d..c9a2e39e34fc 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -21,6 +21,10 @@
 #include "network_helpers.h"
 #include "test_progs.h"
 
+#ifndef IPPROTO_MPTCP
+#define IPPROTO_MPTCP 262
+#endif
+
 #define clean_errno() (errno == 0 ? "None" : strerror(errno))
 #define log_err(MSG, ...) ({						\
 			int __save = errno;				\
@@ -73,13 +77,13 @@ int settimeo(int fd, int timeout_ms)
 
 #define save_errno_close(fd) ({ int __save = errno; close(fd); errno = __save; })
 
-static int __start_server(int type, const struct sockaddr *addr,
+static int __start_server(int type, int protocol, const struct sockaddr *addr,
 			  socklen_t addrlen, int timeout_ms, bool reuseport)
 {
 	int on = 1;
 	int fd;
 
-	fd = socket(addr->sa_family, type, 0);
+	fd = socket(addr->sa_family, type, protocol);
 	if (fd < 0) {
 		log_err("Failed to create server socket");
 		return -1;
@@ -113,8 +117,8 @@ static int __start_server(int type, const struct sockaddr *addr,
 	return -1;
 }
 
-int start_server(int family, int type, const char *addr_str, __u16 port,
-		 int timeout_ms)
+static int start_server_proto(int family, int type, int protocol,
+			      const char *addr_str, __u16 port, int timeout_ms)
 {
 	struct sockaddr_storage addr;
 	socklen_t addrlen;
@@ -122,10 +126,23 @@ int start_server(int family, int type, const char *addr_str, __u16 port,
 	if (make_sockaddr(family, addr_str, port, &addr, &addrlen))
 		return -1;
 
-	return __start_server(type, (struct sockaddr *)&addr,
+	return __start_server(type, protocol, (struct sockaddr *)&addr,
 			      addrlen, timeout_ms, false);
 }
 
+int start_server(int family, int type, const char *addr_str, __u16 port,
+		 int timeout_ms)
+{
+	return start_server_proto(family, type, 0, addr_str, port, timeout_ms);
+}
+
+int start_mptcp_server(int family, const char *addr_str, __u16 port,
+		       int timeout_ms)
+{
+	return start_server_proto(family, SOCK_STREAM, IPPROTO_MPTCP, addr_str,
+				  port, timeout_ms);
+}
+
 int *start_reuseport_server(int family, int type, const char *addr_str,
 			    __u16 port, int timeout_ms, unsigned int nr_listens)
 {
@@ -144,7 +161,7 @@ int *start_reuseport_server(int family, int type, const char *addr_str,
 	if (!fds)
 		return NULL;
 
-	fds[0] = __start_server(type, (struct sockaddr *)&addr, addrlen,
+	fds[0] = __start_server(type, 0, (struct sockaddr *)&addr, addrlen,
 				timeout_ms, true);
 	if (fds[0] == -1)
 		goto close_fds;
@@ -154,7 +171,7 @@ int *start_reuseport_server(int family, int type, const char *addr_str,
 		goto close_fds;
 
 	for (; nr_fds < nr_listens; nr_fds++) {
-		fds[nr_fds] = __start_server(type, (struct sockaddr *)&addr,
+		fds[nr_fds] = __start_server(type, 0, (struct sockaddr *)&addr,
 					     addrlen, timeout_ms, true);
 		if (fds[nr_fds] == -1)
 			goto close_fds;
@@ -265,7 +282,7 @@ int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts)
 	}
 
 	addr_in = (struct sockaddr_in *)&addr;
-	fd = socket(addr_in->sin_family, type, 0);
+	fd = socket(addr_in->sin_family, type, opts->protocol);
 	if (fd < 0) {
 		log_err("Failed to create client socket");
 		return -1;
@@ -298,6 +315,16 @@ int connect_to_fd(int server_fd, int timeout_ms)
 	return connect_to_fd_opts(server_fd, &opts);
 }
 
+int connect_to_mptcp_fd(int server_fd, int timeout_ms)
+{
+	struct network_helper_opts opts = {
+		.timeout_ms = timeout_ms,
+		.protocol = IPPROTO_MPTCP,
+	};
+
+	return connect_to_fd_opts(server_fd, &opts);
+}
+
 int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms)
 {
 	struct sockaddr_storage addr;
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index a4b3b2f9877b..e0feb115b2ae 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -21,6 +21,7 @@ struct network_helper_opts {
 	const char *cc;
 	int timeout_ms;
 	bool must_fail;
+	int protocol;
 };
 
 /* ipv4 test vector */
@@ -42,11 +43,14 @@ extern struct ipv6_packet pkt_v6;
 int settimeo(int fd, int timeout_ms);
 int start_server(int family, int type, const char *addr, __u16 port,
 		 int timeout_ms);
+int start_mptcp_server(int family, const char *addr, __u16 port,
+		       int timeout_ms);
 int *start_reuseport_server(int family, int type, const char *addr_str,
 			    __u16 port, int timeout_ms,
 			    unsigned int nr_listens);
 void free_fds(int *fds, unsigned int nr_close_fds);
 int connect_to_fd(int server_fd, int timeout_ms);
+int connect_to_mptcp_fd(int server_fd, int timeout_ms);
 int connect_to_fd_opts(int server_fd, const struct network_helper_opts *opts);
 int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms);
 int fastopen_connect(int server_fd, const char *data, unsigned int data_len,
diff --git a/tools/testing/selftests/bpf/prog_tests/mptcp.c b/tools/testing/selftests/bpf/prog_tests/mptcp.c
new file mode 100644
index 000000000000..cd548bb2828f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/mptcp.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020, Tessares SA. */
+
+#include <test_progs.h>
+#include "cgroup_helpers.h"
+#include "network_helpers.h"
+
+struct mptcp_storage {
+	__u32 invoked;
+	__u32 is_mptcp;
+};
+
+static int verify_sk(int map_fd, int client_fd, const char *msg, __u32 is_mptcp)
+{
+	int err = 0, cfd = client_fd;
+	struct mptcp_storage val;
+
+	if (is_mptcp == 1)
+		return 0;
+
+	if (CHECK_FAIL(bpf_map_lookup_elem(map_fd, &cfd, &val) < 0)) {
+		perror("Failed to read socket storage");
+		return -1;
+	}
+
+	if (val.invoked != 1) {
+		log_err("%s: unexpected invoked count %d != 1",
+			msg, val.invoked);
+		err++;
+	}
+
+	if (val.is_mptcp != 0) {
+		log_err("%s: unexpected bpf_tcp_sock.is_mptcp %d != 0",
+			msg, val.is_mptcp);
+		err++;
+	}
+
+	return err;
+}
+
+static int run_test(int cgroup_fd, int server_fd, bool is_mptcp)
+{
+	int client_fd, prog_fd, map_fd, err;
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	struct bpf_map *map;
+
+	obj = bpf_object__open("./mptcp_sock.o");
+	if (libbpf_get_error(obj))
+		return -EIO;
+
+	err = bpf_object__load(obj);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(obj, "_sockops");
+	if (CHECK_FAIL(!prog)) {
+		err = -EIO;
+		goto out;
+	}
+
+	prog_fd = bpf_program__fd(prog);
+	if (CHECK_FAIL(prog_fd < 0)) {
+		err = -EIO;
+		goto out;
+	}
+
+	map = bpf_object__find_map_by_name(obj, "socket_storage_map");
+	if (CHECK_FAIL(!map)) {
+		err = -EIO;
+		goto out;
+	}
+
+	map_fd = bpf_map__fd(map);
+	if (CHECK_FAIL(map_fd < 0)) {
+		err = -EIO;
+		goto out;
+	}
+
+	err = bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_SOCK_OPS, 0);
+	if (CHECK_FAIL(err))
+		goto out;
+
+	client_fd = is_mptcp ? connect_to_mptcp_fd(server_fd, 0) :
+			       connect_to_fd(server_fd, 0);
+	if (client_fd < 0) {
+		err = -EIO;
+		goto out;
+	}
+
+	err += is_mptcp ? verify_sk(map_fd, client_fd, "MPTCP subflow socket", 1) :
+			  verify_sk(map_fd, client_fd, "plain TCP socket", 0);
+
+	close(client_fd);
+
+out:
+	bpf_object__close(obj);
+	return err;
+}
+
+void test_base(void)
+{
+	int server_fd, cgroup_fd;
+
+	cgroup_fd = test__join_cgroup("/mptcp");
+	if (CHECK_FAIL(cgroup_fd < 0))
+		return;
+
+	/* without MPTCP */
+	server_fd = start_server(AF_INET, SOCK_STREAM, NULL, 0, 0);
+	if (CHECK_FAIL(server_fd < 0))
+		goto with_mptcp;
+
+	CHECK_FAIL(run_test(cgroup_fd, server_fd, false));
+
+	close(server_fd);
+
+with_mptcp:
+	/* with MPTCP */
+	server_fd = start_mptcp_server(AF_INET, NULL, 0, 0);
+	if (CHECK_FAIL(server_fd < 0))
+		goto close_cgroup_fd;
+
+	CHECK_FAIL(run_test(cgroup_fd, server_fd, true));
+
+	close(server_fd);
+
+close_cgroup_fd:
+	close(cgroup_fd);
+}
+
+void test_mptcp(void)
+{
+	if (test__start_subtest("base"))
+		test_base();
+}
diff --git a/tools/testing/selftests/bpf/progs/mptcp_sock.c b/tools/testing/selftests/bpf/progs/mptcp_sock.c
new file mode 100644
index 000000000000..0d65fb889d03
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/mptcp_sock.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2020, Tessares SA. */
+
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1;
+
+struct mptcp_storage {
+	__u32 invoked;
+	__u32 is_mptcp;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_SK_STORAGE);
+	__uint(map_flags, BPF_F_NO_PREALLOC);
+	__type(key, int);
+	__type(value, struct mptcp_storage);
+} socket_storage_map SEC(".maps");
+
+SEC("sockops")
+int _sockops(struct bpf_sock_ops *ctx)
+{
+	struct mptcp_storage *storage;
+	struct bpf_tcp_sock *tcp_sk;
+	int op = (int)ctx->op;
+	struct bpf_sock *sk;
+
+	if (op != BPF_SOCK_OPS_TCP_CONNECT_CB)
+		return 1;
+
+	sk = ctx->sk;
+	if (!sk)
+		return 1;
+
+	tcp_sk = bpf_tcp_sock(sk);
+	if (!tcp_sk)
+		return 1;
+
+	storage = bpf_sk_storage_get(&socket_storage_map, sk, 0,
+				     BPF_SK_STORAGE_GET_F_CREATE);
+	if (!storage)
+		return 1;
+
+	storage->invoked++;
+	storage->is_mptcp = tcp_sk->is_mptcp;
+
+	return 1;
+}
-- 
2.36.0


^ permalink raw reply related

* [PATCH bpf-next 2/7] bpf: add bpf_skc_to_mptcp_sock_proto
From: Mat Martineau @ 2022-04-20 22:24 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Geliang Tang, ast, daniel, andrii, mptcp, Nicolas Rybowski,
	Matthieu Baerts, Mat Martineau
In-Reply-To: <20220420222459.307649-1-mathew.j.martineau@linux.intel.com>

From: Geliang Tang <geliang.tang@suse.com>

This patch implements a new struct bpf_func_proto, named
bpf_skc_to_mptcp_sock_proto. Define a new bpf_id BTF_SOCK_TYPE_MPTCP,
and a new helper bpf_skc_to_mptcp_sock(), which invokes another new
helper bpf_mptcp_sock_from_subflow() in net/mptcp/bpf.c to get struct
mptcp_sock from a given subflow socket.

Co-developed-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/linux/btf_ids.h        |  3 ++-
 include/net/mptcp.h            |  6 ++++++
 include/uapi/linux/bpf.h       |  7 +++++++
 net/core/filter.c              | 17 +++++++++++++++++
 net/mptcp/Makefile             |  4 ++++
 net/mptcp/bpf.c                | 22 ++++++++++++++++++++++
 scripts/bpf_doc.py             |  2 ++
 tools/include/uapi/linux/bpf.h |  7 +++++++
 8 files changed, 67 insertions(+), 1 deletion(-)
 create mode 100644 net/mptcp/bpf.c

diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
index bc5d9cc34e4c..335a19092368 100644
--- a/include/linux/btf_ids.h
+++ b/include/linux/btf_ids.h
@@ -178,7 +178,8 @@ extern struct btf_id_set name;
 	BTF_SOCK_TYPE(BTF_SOCK_TYPE_TCP6, tcp6_sock)			\
 	BTF_SOCK_TYPE(BTF_SOCK_TYPE_UDP, udp_sock)			\
 	BTF_SOCK_TYPE(BTF_SOCK_TYPE_UDP6, udp6_sock)			\
-	BTF_SOCK_TYPE(BTF_SOCK_TYPE_UNIX, unix_sock)
+	BTF_SOCK_TYPE(BTF_SOCK_TYPE_UNIX, unix_sock)			\
+	BTF_SOCK_TYPE(BTF_SOCK_TYPE_MPTCP, mptcp_sock)
 
 enum {
 #define BTF_SOCK_TYPE(name, str) name,
diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index 0a3b0fb04a3b..5b3a6f783182 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -283,4 +283,10 @@ static inline int mptcpv6_init(void) { return 0; }
 static inline void mptcpv6_handle_mapped(struct sock *sk, bool mapped) { }
 #endif
 
+#if defined(CONFIG_MPTCP) && defined(CONFIG_BPF_JIT) && defined(CONFIG_BPF_SYSCALL)
+struct mptcp_sock *bpf_mptcp_sock_from_subflow(struct sock *sk);
+#else
+static inline struct mptcp_sock *bpf_mptcp_sock_from_subflow(struct sock *sk) { return NULL; }
+#endif
+
 #endif /* __NET_MPTCP_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 9ef1f3e1c22f..785f2cb15495 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5143,6 +5143,12 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * struct mptcp_sock *bpf_skc_to_mptcp_sock(void *sk)
+ *	Description
+ *		Dynamically cast a *sk* pointer to a *mptcp_sock* pointer.
+ *	Return
+ *		*sk* if casting is valid, or **NULL** otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5339,6 +5345,7 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(skc_to_mptcp_sock),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index 7b1867f1f422..4081c55f6f78 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,7 @@
 #include <linux/btf_ids.h>
 #include <net/tls.h>
 #include <net/xdp.h>
+#include <net/mptcp.h>
 
 static const struct bpf_func_proto *
 bpf_sk_base_func_proto(enum bpf_func_id func_id);
@@ -11279,6 +11280,19 @@ const struct bpf_func_proto bpf_skc_to_unix_sock_proto = {
 	.ret_btf_id		= &btf_sock_ids[BTF_SOCK_TYPE_UNIX],
 };
 
+BPF_CALL_1(bpf_skc_to_mptcp_sock, struct sock *, sk)
+{
+	return (unsigned long)bpf_mptcp_sock_from_subflow(sk);
+}
+
+static const struct bpf_func_proto bpf_skc_to_mptcp_sock_proto = {
+	.func		= bpf_skc_to_mptcp_sock,
+	.gpl_only	= false,
+	.ret_type	= RET_PTR_TO_BTF_ID_OR_NULL,
+	.arg1_type	= ARG_PTR_TO_SOCK_COMMON,
+	.ret_btf_id	= &btf_sock_ids[BTF_SOCK_TYPE_MPTCP],
+};
+
 BPF_CALL_1(bpf_sock_from_file, struct file *, file)
 {
 	return (unsigned long)sock_from_file(file);
@@ -11321,6 +11335,9 @@ bpf_sk_base_func_proto(enum bpf_func_id func_id)
 	case BPF_FUNC_skc_to_unix_sock:
 		func = &bpf_skc_to_unix_sock_proto;
 		break;
+	case BPF_FUNC_skc_to_mptcp_sock:
+		func = &bpf_skc_to_mptcp_sock_proto;
+		break;
 	case BPF_FUNC_ktime_get_coarse_ns:
 		return &bpf_ktime_get_coarse_ns_proto;
 	default:
diff --git a/net/mptcp/Makefile b/net/mptcp/Makefile
index e54daceac58b..168c55d1c917 100644
--- a/net/mptcp/Makefile
+++ b/net/mptcp/Makefile
@@ -10,3 +10,7 @@ obj-$(CONFIG_INET_MPTCP_DIAG) += mptcp_diag.o
 mptcp_crypto_test-objs := crypto_test.o
 mptcp_token_test-objs := token_test.o
 obj-$(CONFIG_MPTCP_KUNIT_TEST) += mptcp_crypto_test.o mptcp_token_test.o
+
+ifeq ($(CONFIG_BPF_JIT),y)
+obj-$(CONFIG_BPF_SYSCALL) += bpf.o
+endif
diff --git a/net/mptcp/bpf.c b/net/mptcp/bpf.c
new file mode 100644
index 000000000000..535602ba2582
--- /dev/null
+++ b/net/mptcp/bpf.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Multipath TCP
+ *
+ * Copyright (c) 2020, Tessares SA.
+ * Copyright (c) 2022, SUSE.
+ *
+ * Author: Nicolas Rybowski <nicolas.rybowski@tessares.net>
+ */
+
+#define pr_fmt(fmt) "MPTCP: " fmt
+
+#include <linux/bpf.h>
+#include "protocol.h"
+
+struct mptcp_sock *bpf_mptcp_sock_from_subflow(struct sock *sk)
+{
+	if (sk && sk_fullsock(sk) && sk->sk_protocol == IPPROTO_TCP && sk_is_mptcp(sk))
+		return mptcp_sk(mptcp_subflow_ctx(sk)->conn);
+
+	return NULL;
+}
+EXPORT_SYMBOL(bpf_mptcp_sock_from_subflow);
diff --git a/scripts/bpf_doc.py b/scripts/bpf_doc.py
index 096625242475..d5452f7eb996 100755
--- a/scripts/bpf_doc.py
+++ b/scripts/bpf_doc.py
@@ -633,6 +633,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct mptcp_sock',
     ]
     known_types = {
             '...',
@@ -682,6 +683,7 @@ class PrinterHelpers(Printer):
             'struct socket',
             'struct file',
             'struct bpf_timer',
+            'struct mptcp_sock',
     }
     mapped_types = {
             'u8': '__u8',
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9ef1f3e1c22f..785f2cb15495 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5143,6 +5143,12 @@ union bpf_attr {
  *		The **hash_algo** is returned on success,
  *		**-EOPNOTSUP** if the hash calculation failed or **-EINVAL** if
  *		invalid arguments are passed.
+ *
+ * struct mptcp_sock *bpf_skc_to_mptcp_sock(void *sk)
+ *	Description
+ *		Dynamically cast a *sk* pointer to a *mptcp_sock* pointer.
+ *	Return
+ *		*sk* if casting is valid, or **NULL** otherwise.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5339,6 +5345,7 @@ union bpf_attr {
 	FN(copy_from_user_task),	\
 	FN(skb_set_tstamp),		\
 	FN(ima_file_hash),		\
+	FN(skc_to_mptcp_sock),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.36.0


^ permalink raw reply related

* [PATCH bpf-next 1/7] bpf: expose is_mptcp flag to bpf_tcp_sock
From: Mat Martineau @ 2022-04-20 22:24 UTC (permalink / raw)
  To: netdev, bpf
  Cc: Nicolas Rybowski, ast, daniel, andrii, mptcp, Matthieu Baerts,
	Mat Martineau
In-Reply-To: <20220420222459.307649-1-mathew.j.martineau@linux.intel.com>

From: Nicolas Rybowski <nicolas.rybowski@tessares.net>

is_mptcp is a field from struct tcp_sock used to indicate that the
current tcp_sock is part of the MPTCP protocol.

In this protocol, a first socket (mptcp_sock) is created with
sk_protocol set to IPPROTO_MPTCP (=262) for control purpose but it
isn't directly on the wire. This is the role of the subflow (kernel)
sockets which are classical tcp_sock with sk_protocol set to
IPPROTO_TCP. The only way to differentiate such sockets from plain TCP
sockets is the is_mptcp field from tcp_sock.

Such an exposure in BPF is thus required to be able to differentiate
plain TCP sockets from MPTCP subflow sockets in BPF_PROG_TYPE_SOCK_OPS
programs.

The choice has been made to silently pass the case when CONFIG_MPTCP is
unset by defaulting is_mptcp to 0 in order to make BPF independent of
the MPTCP configuration. Another solution is to make the verifier fail
in 'bpf_tcp_sock_is_valid_ctx_access' but this will add an additional
'#ifdef CONFIG_MPTCP' in the BPF code and a same injected BPF program
will not run if MPTCP is not set.

An example use-case is provided in
https://github.com/multipath-tcp/mptcp_net-next/tree/scripts/bpf/examples

Suggested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/uapi/linux/bpf.h       | 1 +
 net/core/filter.c              | 9 ++++++++-
 tools/include/uapi/linux/bpf.h | 1 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d14b10b85e51..9ef1f3e1c22f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5694,6 +5694,7 @@ struct bpf_tcp_sock {
 	__u32 delivered;	/* Total data packets delivered incl. rexmits */
 	__u32 delivered_ce;	/* Like the above but only ECE marked packets */
 	__u32 icsk_retransmits;	/* Number of unrecovered [RTO] timeouts */
+	__u32 is_mptcp;		/* Is MPTCP subflow? */
 };
 
 struct bpf_sock_tuple {
diff --git a/net/core/filter.c b/net/core/filter.c
index 143f442a9505..7b1867f1f422 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -6754,7 +6754,7 @@ bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
 				  struct bpf_insn_access_aux *info)
 {
 	if (off < 0 || off >= offsetofend(struct bpf_tcp_sock,
-					  icsk_retransmits))
+					  is_mptcp))
 		return false;
 
 	if (off % size != 0)
@@ -6888,6 +6888,13 @@ u32 bpf_tcp_sock_convert_ctx_access(enum bpf_access_type type,
 	case offsetof(struct bpf_tcp_sock, icsk_retransmits):
 		BPF_INET_SOCK_GET_COMMON(icsk_retransmits);
 		break;
+	case offsetof(struct bpf_tcp_sock, is_mptcp):
+#ifdef CONFIG_MPTCP
+		BPF_TCP_SOCK_GET_COMMON(is_mptcp);
+#else
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d14b10b85e51..9ef1f3e1c22f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5694,6 +5694,7 @@ struct bpf_tcp_sock {
 	__u32 delivered;	/* Total data packets delivered incl. rexmits */
 	__u32 delivered_ce;	/* Like the above but only ECE marked packets */
 	__u32 icsk_retransmits;	/* Number of unrecovered [RTO] timeouts */
+	__u32 is_mptcp;		/* Is MPTCP subflow? */
 };
 
 struct bpf_sock_tuple {
-- 
2.36.0


^ permalink raw reply related

* [PATCH bpf-next 0/7] bpf: mptcp: Support for mptcp_sock and is_mptcp
From: Mat Martineau @ 2022-04-20 22:24 UTC (permalink / raw)
  To: netdev, bpf; +Cc: Mat Martineau, ast, daniel, andrii, mptcp

Hello BPF maintainers -

I'm one of the MPTCP subsystem maintainers. We have a MPTCP mailing list
and git repo, and have so far been upstreaming all of our commits
through the net-next and net trees. This is our first patch set for
bpf-next.

Our larger BPF-related project right now is implementing BPF-based
packet scheduling for MPTCP. One MPTCP connection may aggregate multiple
TCP "subflows", and the packet scheduler chooses which of those subflows
to use for each outgoing packet. So far we have been focusing on
BPF-based TCP congestion control code as a template.

This patch set adds BPF access to the is_mptcp flag in tcp_sock and
access to mptcp_sock structures, along with associated self tests. You
may recognize some of the code from earlier
(https://lore.kernel.org/bpf/20200918121046.190240-6-nicolas.rybowski@tessares.net/)
but it has been reworked quite a bit.

Our current plan for MPTCP-related BPF patches is to continue doing
initial review on the MPTCP mailing list, and then upstream those
changes through the mailing lists to the bpf-next or bpf trees as
appropriate. This has worked well for net-next and net so far, but if
you'd prefer to handle MPTCP/BPF changes differently we can discuss
alternatives of course!

Thanks,

Mat

Geliang Tang (5):
  bpf: add bpf_skc_to_mptcp_sock_proto
  selftests: bpf: test bpf_skc_to_mptcp_sock
  selftests: bpf: verify token of struct mptcp_sock
  selftests: bpf: verify ca_name of struct mptcp_sock
  selftests: bpf: verify first of struct mptcp_sock

Nicolas Rybowski (2):
  bpf: expose is_mptcp flag to bpf_tcp_sock
  selftests: bpf: add MPTCP test base

 MAINTAINERS                                   |   2 +
 include/linux/btf_ids.h                       |   3 +-
 include/net/mptcp.h                           |   6 +
 include/uapi/linux/bpf.h                      |   8 +
 net/core/filter.c                             |  26 +-
 net/mptcp/Makefile                            |   4 +
 net/mptcp/bpf.c                               |  22 ++
 scripts/bpf_doc.py                            |   2 +
 tools/include/uapi/linux/bpf.h                |   8 +
 .../testing/selftests/bpf/bpf_mptcp_helpers.h |  17 ++
 tools/testing/selftests/bpf/bpf_tcp_helpers.h |   4 +
 tools/testing/selftests/bpf/config            |   1 +
 tools/testing/selftests/bpf/network_helpers.c |  43 ++-
 tools/testing/selftests/bpf/network_helpers.h |   4 +
 .../testing/selftests/bpf/prog_tests/mptcp.c  | 258 ++++++++++++++++++
 .../testing/selftests/bpf/progs/mptcp_sock.c  |  76 ++++++
 16 files changed, 474 insertions(+), 10 deletions(-)
 create mode 100644 net/mptcp/bpf.c
 create mode 100644 tools/testing/selftests/bpf/bpf_mptcp_helpers.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/mptcp.c
 create mode 100644 tools/testing/selftests/bpf/progs/mptcp_sock.c

base-commit: c7655df434de1dab1af1b1ba2aad757b15e25b83
-- 
2.36.0

^ permalink raw reply

* Re: [PATCH bpf-next] samples/bpf: reduce the sampling interval in xdp1_user
From: patchwork-bot+netdevbpf @ 2022-04-20 22:10 UTC (permalink / raw)
  To: Zhengchao Shao
  Cc: bpf, netdev, linux-kernel, ast, daniel, davem, kuba, hawk,
	john.fastabend, andrii, kafai, songliubraving, yhs, kpsingh,
	weiyongjun1, yuehaibing
In-Reply-To: <20220419114746.291613-1-shaozhengchao@huawei.com>

Hello:

This patch was applied to bpf/bpf-next.git (master)
by Andrii Nakryiko <andrii@kernel.org>:

On Tue, 19 Apr 2022 19:47:46 +0800 you wrote:
> If interval is 2, and sum - prev[key] = 1, the result = 0. This will
> mislead the tester that the port has no traffic right now. So reduce the
> sampling interval to 1.
> 
> Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
> ---
>  samples/bpf/xdp1_user.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Here is the summary with links:
  - [bpf-next] samples/bpf: reduce the sampling interval in xdp1_user
    https://git.kernel.org/bpf/bpf-next/c/db69264f983a

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox