All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sabrina Dubroca <sd@queasysnail.net>
To: Antonio Quartulli <antonio@openvpn.net>
Cc: netdev@vger.kernel.org, Eric Dumazet <edumazet@google.com>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Donald Hunter <donald.hunter@gmail.com>,
	Shuah Khan <shuah@kernel.org>,
	ryazanov.s.a@gmail.com, Andrew Lunn <andrew+netdev@lunn.ch>,
	Simon Horman <horms@kernel.org>,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Xiao Liang <shaw.leon@gmail.com>,
	steffen.klassert@secunet.com, antony.antony@secunet.com,
	willemdebruijn.kernel@gmail.com, David Ahern <dsahern@kernel.org>,
	Andrew Lunn <andrew@lunn.ch>,
	Shuah Khan <skhan@linuxfoundation.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH net-next v19 00/26] Introducing OpenVPN Data Channel Offload
Date: Thu, 13 Feb 2025 16:46:11 +0100	[thread overview]
Message-ID: <Z64Tw02PO433bob8@hog> (raw)
In-Reply-To: <090524ac-724d-4915-8699-fe2ae736ab8c@openvpn.net>

2025-02-13, 12:46:34 +0100, Antonio Quartulli wrote:
> On 13/02/2025 00:34, Sabrina Dubroca wrote:
> > Hello,
> > 
> > 2025-02-11, 01:39:53 +0100, Antonio Quartulli wrote:
> > > All minor and major reported problems have been finally addressed.
> > > Big thanks to Sabrina, who took the time to guide me through
> > > converting the peer socket to an RCU pointer.
> > 
> > Something is off (not sure if it's new to this version): if I use
> > test-tcp.sh to setup a set of interfaces and peers (I stop the test
> > just after setup to keep the environment alive), then remove all netns
> > with "ip -all netns delete", I expect all devices to go away, but they
> > don't. With debug messages enabled I'm seeing some activity from the
> > module ("tun0: sending keepalive to peer 3" and so on), and
> > ovpn_net_uninit/ovpn_priv_free never got called.
> 
> I can reproduce it. If later I rmmod ovpn I then get all the "Deleting peer"
> messages.
> So instances are not being purged on netns exit.
> 
> Will dive into it.

I think the socket holds a ref on the netns, so it's not getting
destroyed, simply "removed" from iproute's point of view. And the
socket isn't going away as long as it's used by a peer.

If I delete the peer(s) for the ovpn device and then the netns it was
in, the netns is fully removed, and the ovpn device is gone. Also no
issue if I delete the ovpn device before its netns, then everything is
destroyed as expected.

I'm not sure that can be solved, as least under the current refcount
scheme.

But I don't think there's a way to re-attach to that netns afterwards
if we wanted to clean up manually (something similar to "ip netns
attach <name> <pid>", but that won't work if whatever created the
socket is not running anymore -- as is the case with ovpn-cli).

> > 
> > [...]
> > > So there is NO risk of deadlock (and indeed nothing hangs), but I
> > > couldn't find a way to make the warning go away.
> > 
> > I've spotted another splat on strparser cleanup that looked like an
> > actual deadlock, but it's not very reproducible. Still looking into
> > it, but I'm not convinced it's ok to call strp_done (as is done from
> > ovpn_tcp_socket_detach) while under lock_sock, because AFAIU
> > cancel_work_sync(&strp->work) may be waiting for a work that needs to
> > lock the socket (cb.lock in do_strp_work). I guess tcp_tx_work would
> > have the same problem.
> 
> Will have a look here too.

The only ways I've managed to reproduce it is by using some ugly
kernel-side hacks to try to force that path being hit. Either forcing
the strp work to be queued just as we detach the socket (in
strp_stop), or this:

 - hack some code path to do a big sleep() under lock_sock(), to give
   me the time to do the next steps
 - ping over ovpn, or any other way to add packets on the receive
   socket
 - delete the peer for the sleeping socket (while it's still sleeping)

When that big sleep is over, strp_data_ready will kick off and queue
its worker (because at this point the socket lock is still owned),
then I think del_peer proceeds toward tcp_detach and I got the splat
below. So that's maybe a bit hard to trigger in real life (I don't
remember what I was doing the first time I got it, I think something
messing with the RCU stuff we discussed earlier).


[  922.681435][  T300] ======================================================
[  922.686247][  T300] WARNING: possible circular locking dependency detected
[  922.690971][  T300] 6.14.0-rc1-net-00276-g95f1f7ea224e-dirty #279 Tainted: G                 N
[  922.696584][  T300] ------------------------------------------------------
[  922.699697][  T300] kworker/1:2/300 is trying to acquire lock:
[  922.702105][  T300] ffff88800a662160 ((work_completion)(&strp->work)){+.+.}-{0:0}, at: start_flush_work+0x407/0xa50
[  922.705716][  T300] 
[  922.705716][  T300] but task is already holding lock:
[  922.707779][  T300] ffff8880113edd98 (sk_lock-AF_INET){+.+.}-{0:0}, at: ovpn_socket_release+0x8a/0x1a0
[  922.710238][  T300] 
[  922.710238][  T300] which lock already depends on the new lock.
[  922.710238][  T300] 
[  922.712628][  T300] 
[  922.712628][  T300] the existing dependency chain (in reverse order) is:
[  922.714443][  T300] 
[  922.714443][  T300] -> #1 (sk_lock-AF_INET){+.+.}-{0:0}:
[  922.716127][  T300]        __lock_acquire+0xc4d/0x1ee0
[  922.717250][  T300]        lock_acquire+0x1a9/0x500
[  922.718266][  T300]        lock_sock_nested+0x40/0xf0
[  922.719325][  T300]        strp_work+0x95/0x1e0
[  922.720240][  T300]        process_one_work+0xe28/0x1460
[  922.721307][  T300]        worker_thread+0x674/0xee0
[  922.722283][  T300]        kthread+0x3c3/0x760
[  922.723101][  T300]        ret_from_fork+0x46/0x80
[  922.723792][  T300]        ret_from_fork_asm+0x1a/0x30
[  922.724532][  T300] 
[  922.724532][  T300] -> #0 ((work_completion)(&strp->work)){+.+.}-{0:0}:
[  922.726038][  T300]        check_prev_add+0x1af/0x2400
[  922.726927][  T300]        validate_chain+0xdcf/0x1a10
[  922.727847][  T300]        __lock_acquire+0xc4d/0x1ee0
[  922.728721][  T300]        lock_acquire+0x1a9/0x500
[  922.729590][  T300]        start_flush_work+0x41a/0xa50
[  922.730434][  T300]        __flush_work+0xee/0x210
[  922.731213][  T300]        cancel_work_sync+0xb8/0xd0
[  922.732022][  T300]        strp_done.cold+0x51/0xcf
[  922.732830][  T300]        ovpn_tcp_socket_detach+0x28e/0x2de
[  922.733752][  T300]        ovpn_socket_release_kref+0x1ef/0x350
[  922.734713][  T300]        ovpn_socket_release+0xe7/0x1a0
[  922.735577][  T300]        ovpn_peer_remove_work+0x2b/0x90
[  922.736468][  T300]        process_one_work+0xe28/0x1460
[  922.737357][  T300]        worker_thread+0x674/0xee0
[  922.737985][  T300]        kthread+0x3c3/0x760
[  922.738545][  T300]        ret_from_fork+0x46/0x80
[  922.739177][  T300]        ret_from_fork_asm+0x1a/0x30
[  922.739846][  T300] 
[  922.739846][  T300] other info that might help us debug this:
[  922.739846][  T300] 
[  922.741159][  T300]  Possible unsafe locking scenario:
[  922.741159][  T300] 
[  922.742165][  T300]        CPU0                    CPU1
[  922.743055][  T300]        ----                    ----
[  922.743943][  T300]   lock(sk_lock-AF_INET);
[  922.744638][  T300]                                lock((work_completion)(&strp->work));
[  922.745786][  T300]                                lock(sk_lock-AF_INET);
[  922.746837][  T300]   lock((work_completion)(&strp->work));
[  922.747788][  T300] 
[  922.747788][  T300]  *** DEADLOCK ***
[  922.747788][  T300] 
[  922.748856][  T300] 4 locks held by kworker/1:2/300:
[  922.749474][  T300]  #0: ffff888008499b48 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0xd80/0x1460
[  922.750588][  T300]  #1: ffffc90002527d30 ((work_completion)(&peer->remove_work)){+.+.}-{0:0}, at: process_one_work+0xddf/0x1460
[  922.751894][  T300]  #2: ffff8880113edd98 (sk_lock-AF_INET){+.+.}-{0:0}, at: ovpn_socket_release+0x8a/0x1a0
[  922.752983][  T300]  #3: ffffffff854de980 (rcu_read_lock){....}-{1:3}, at: start_flush_work+0x39/0xa50
[  922.754018][  T300] 
[  922.754018][  T300] stack backtrace:
[  922.754734][  T300] CPU: 1 UID: 0 PID: 300 Comm: kworker/1:2 Tainted: G                 N 6.14.0-rc1-net-00276-g95f1f7ea224e-dirty #279
[  922.754748][  T300] Tainted: [N]=TEST
[  922.754752][  T300] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[  922.754761][  T300] Workqueue: events ovpn_peer_remove_work
[  922.754779][  T300] Call Trace:
[  922.754785][  T300]  <TASK>
[  922.754791][  T300]  dump_stack_lvl+0xa5/0x100
[  922.754803][  T300]  print_circular_bug.cold+0x38/0x48
[  922.754820][  T300]  check_noncircular+0x2f2/0x3d0
[  922.754922][  T300]  check_prev_add+0x1af/0x2400
[  922.754942][  T300]  validate_chain+0xdcf/0x1a10
[  922.754991][  T300]  __lock_acquire+0xc4d/0x1ee0
[  922.755023][  T300]  lock_acquire+0x1a9/0x500
[  922.755104][  T300]  start_flush_work+0x41a/0xa50
[  922.755128][  T300]  __flush_work+0xee/0x210
[  922.755198][  T300]  cancel_work_sync+0xb8/0xd0
[  922.755211][  T300]  strp_done.cold+0x51/0xcf
[  922.755222][  T300]  ovpn_tcp_socket_detach+0x28e/0x2de
[  922.755237][  T300]  ovpn_socket_release_kref+0x1ef/0x350
[  922.755253][  T300]  ovpn_socket_release+0xe7/0x1a0
[  922.755268][  T300]  ovpn_peer_remove_work+0x2b/0x90
[  922.755282][  T300]  process_one_work+0xe28/0x1460
[  922.755330][  T300]  worker_thread+0x674/0xee0
[  922.755402][  T300]  kthread+0x3c3/0x760
[  922.755472][  T300]  ret_from_fork+0x46/0x80
[  922.755497][  T300]  ret_from_fork_asm+0x1a/0x30
[  922.755521][  T300]  </TASK>


And then come the hung task warnings:
 - del_peer waiting on the completion
 - strp waiting on lock_sock
 - peer cleanup/socket detach waiting on cancel_work_sync/strp

[ 1106.347400][   T39] INFO: task kworker/u16:0:11 blocked for more than 122 seconds.
[ 1106.348547][   T39]       Tainted: G                 N 6.14.0-rc1-net-00276-g95f1f7ea224e-dirty #279
[ 1106.349671][   T39] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1106.350748][   T39] task:kworker/u16:0   state:D stack:0     pid:11    tgid:11    ppid:2      task_flags:0x4208160 flags:0x00004000
[ 1106.352454][   T39] Workqueue: kstrp strp_work
[ 1106.353133][   T39] Call Trace:
[ 1106.353659][   T39]  <TASK>
[ 1106.354143][   T39]  __schedule+0xace/0x2620
[ 1106.360659][   T39]  schedule+0xd0/0x210
[ 1106.361266][   T39]  __lock_sock+0x137/0x230
[ 1106.365430][   T39]  lock_sock_nested+0xcb/0xf0
[ 1106.366173][   T39]  strp_work+0x95/0x1e0
[ 1106.366845][   T39]  process_one_work+0xe28/0x1460
[ 1106.369231][   T39]  worker_thread+0x674/0xee0
[ 1106.371594][   T39]  kthread+0x3c3/0x760
[ 1106.375692][   T39]  ret_from_fork+0x46/0x80
[ 1106.377040][   T39]  ret_from_fork_asm+0x1a/0x30
[ 1106.377851][   T39]  </TASK>
[ 1106.378355][   T39] INFO: task kworker/1:2:300 blocked for more than 122 seconds.
[ 1106.379590][   T39]       Tainted: G                 N 6.14.0-rc1-net-00276-g95f1f7ea224e-dirty #279
[ 1106.381079][   T39] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1106.382372][   T39] task:kworker/1:2     state:D stack:0     pid:300   tgid:300   ppid:2      task_flags:0x4208060 flags:0x00004000
[ 1106.384211][   T39] Workqueue: events ovpn_peer_remove_work
[ 1106.385072][   T39] Call Trace:
[ 1106.385648][   T39]  <TASK>
[ 1106.386139][   T39]  __schedule+0xace/0x2620
[ 1106.393518][   T39]  schedule+0xd0/0x210
[ 1106.394195][   T39]  schedule_timeout+0x18c/0x240
[ 1106.398840][   T39]  __wait_for_common+0x3e3/0x610
[ 1106.403079][   T39]  __flush_work+0x14f/0x210
[ 1106.406874][   T39]  cancel_work_sync+0xb8/0xd0
[ 1106.407519][   T39]  strp_done.cold+0x51/0xcf
[ 1106.408109][   T39]  ovpn_tcp_socket_detach+0x28e/0x2de
[ 1106.408851][   T39]  ovpn_socket_release_kref+0x1ef/0x350
[ 1106.409588][   T39]  ovpn_socket_release+0xe7/0x1a0
[ 1106.410271][   T39]  ovpn_peer_remove_work+0x2b/0x90
[ 1106.410973][   T39]  process_one_work+0xe28/0x1460
[ 1106.413075][   T39]  worker_thread+0x674/0xee0
[ 1106.416968][   T39]  kthread+0x3c3/0x760
[ 1106.419983][   T39]  ret_from_fork+0x46/0x80
[ 1106.421172][   T39]  ret_from_fork_asm+0x1a/0x30
[ 1106.421829][   T39]  </TASK>
[ 1106.422260][   T39] INFO: task ovpn-cli:1213 blocked for more than 122 seconds.
[ 1106.423215][   T39]       Tainted: G                 N 6.14.0-rc1-net-00276-g95f1f7ea224e-dirty #279
[ 1106.424451][   T39] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1106.425643][   T39] task:ovpn-cli        state:D stack:0     pid:1213  tgid:1213  ppid:514    task_flags:0x400100 flags:0x00000002
[ 1106.427145][   T39] Call Trace:
[ 1106.427610][   T39]  <TASK>
[ 1106.428005][   T39]  __schedule+0xace/0x2620
[ 1106.431820][   T39]  schedule+0xd0/0x210
[ 1106.432351][   T39]  schedule_timeout+0x18c/0x240
[ 1106.435050][   T39]  __wait_for_common+0x3e3/0x610
[ 1106.439809][   T39]  ovpn_nl_peer_del_doit+0x270/0x6e0
[ 1106.442825][   T39]  genl_family_rcv_msg_doit+0x1ea/0x2e0
[ 1106.445602][   T39]  genl_family_rcv_msg+0x3a7/0x5b0
[ 1106.450559][   T39]  genl_rcv_msg+0xb1/0x160
[ 1106.451154][   T39]  netlink_rcv_skb+0x13e/0x3d0
[ 1106.455220][   T39]  genl_rcv+0x29/0x40
[ 1106.455737][   T39]  netlink_unicast+0x491/0x730
[ 1106.457107][   T39]  netlink_sendmsg+0x77d/0xc00
[ 1106.458517][   T39]  ____sys_sendmsg+0x7c5/0xac0
[ 1106.461329][   T39]  ___sys_sendmsg+0x163/0x1b0
[ 1106.468146][   T39]  __sys_sendmsg+0x135/0x1d0
[ 1106.471564][   T39]  do_syscall_64+0x64/0x140
[ 1106.472173][   T39]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1106.472947][   T39] RIP: 0033:0x7faa76628e56
[ 1106.473558][   T39] RSP: 002b:00007ffe757311e0 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[ 1106.474674][   T39] RAX: ffffffffffffffda RBX: 00007faa7654f740 RCX: 00007faa76628e56
[ 1106.475715][   T39] RDX: 0000000000000000 RSI: 00007ffe75731270 RDI: 0000000000000003
[ 1106.476847][   T39] RBP: 00007ffe757311f0 R08: 0000000000000000 R09: 0000000000000000
[ 1106.477959][   T39] R10: 0000000000000000 R11: 0000000000000202 R12: 00005617490cf490
[ 1106.479064][   T39] R13: 00005617490d04f0 R14: 00007ffe75731270 R15: 0000561733d62d78
[ 1106.480175][   T39]  </TASK>
[ 1106.480646][   T39] INFO: lockdep is turned off.

-- 
Sabrina

  reply	other threads:[~2025-02-13 15:46 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-11  0:39 [PATCH net-next v19 00/26] Introducing OpenVPN Data Channel Offload Antonio Quartulli
2025-02-11  0:39 ` [PATCH net-next v19 01/26] net: introduce OpenVPN Data Channel Offload (ovpn) Antonio Quartulli
2025-02-11  0:39 ` [PATCH net-next v19 02/26] ovpn: add basic netlink support Antonio Quartulli
2025-02-11  0:39 ` [PATCH net-next v19 03/26] ovpn: add basic interface creation/destruction/management routines Antonio Quartulli
2025-02-11  0:39 ` [PATCH net-next v19 04/26] ovpn: keep carrier always on for MP interfaces Antonio Quartulli
2025-02-11  0:39 ` [PATCH net-next v19 05/26] ovpn: introduce the ovpn_peer object Antonio Quartulli
2025-02-11  0:39 ` [PATCH net-next v19 06/26] ovpn: introduce the ovpn_socket object Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 07/26] ovpn: implement basic TX path (UDP) Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 08/26] ovpn: implement basic RX " Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 09/26] ovpn: implement packet processing Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 10/26] ovpn: store tunnel and transport statistics Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 11/26] ipv6: export inet6_stream_ops via EXPORT_SYMBOL_GPL Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 12/26] ovpn: implement TCP transport Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 13/26] skb: implement skb_send_sock_locked_with_flags() Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 14/26] ovpn: add support for MSG_NOSIGNAL in tcp_sendmsg Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 15/26] ovpn: implement multi-peer support Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 16/26] ovpn: implement peer lookup logic Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 17/26] ovpn: implement keepalive mechanism Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 18/26] ovpn: add support for updating local UDP endpoint Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 19/26] ovpn: add support for peer floating Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 20/26] ovpn: implement peer add/get/dump/delete via netlink Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 21/26] ovpn: implement key add/get/del/swap " Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 22/26] ovpn: kill key and notify userspace in case of IV exhaustion Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 23/26] ovpn: notify userspace when a peer is deleted Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 24/26] ovpn: add basic ethtool support Antonio Quartulli
2025-02-11  0:40 ` [PATCH net-next v19 25/26] testing/selftests: add test tool and scripts for ovpn module Antonio Quartulli
2025-02-13  0:36   ` kernel test robot
2025-02-14  1:49   ` kernel test robot
2025-02-11  0:40 ` [PATCH net-next v19 26/26] mailmap: remove unwanted entry for Antonio Quartulli Antonio Quartulli
2025-02-12 23:34 ` [PATCH net-next v19 00/26] Introducing OpenVPN Data Channel Offload Sabrina Dubroca
2025-02-13 11:46   ` Antonio Quartulli
2025-02-13 15:46     ` Sabrina Dubroca [this message]
2025-02-13 19:40       ` Antonio Quartulli
2025-02-14 13:54         ` Antonio Quartulli
2025-02-14 14:00           ` Antonio Quartulli
2025-02-13 14:26 ` Sean Anderson
2025-02-13 19:46   ` Antonio Quartulli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z64Tw02PO433bob8@hog \
    --to=sd@queasysnail.net \
    --cc=akpm@linux-foundation.org \
    --cc=andrew+netdev@lunn.ch \
    --cc=andrew@lunn.ch \
    --cc=antonio@openvpn.net \
    --cc=antony.antony@secunet.com \
    --cc=donald.hunter@gmail.com \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=ryazanov.s.a@gmail.com \
    --cc=shaw.leon@gmail.com \
    --cc=shuah@kernel.org \
    --cc=skhan@linuxfoundation.org \
    --cc=steffen.klassert@secunet.com \
    --cc=willemdebruijn.kernel@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.