From: Greg KH <gregkh@linuxfoundation.org>
To: Antoine Tenart <atenart@kernel.org>
Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com,
edumazet@google.com, netdev@vger.kernel.org, mhocko@suse.com,
stephen@networkplumber.org
Subject: Re: [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes
Date: Wed, 18 Oct 2023 18:49:18 +0200 [thread overview]
Message-ID: <2023101840-scabbed-visitor-3fdd@gregkh> (raw)
In-Reply-To: <20231018154804.420823-2-atenart@kernel.org>
On Wed, Oct 18, 2023 at 05:47:43PM +0200, Antoine Tenart wrote:
> We have an ABBA deadlock between net device unregistration and sysfs
> files being accessed[1][2]. To prevent this from happening all paths
> taking the rtnl lock after the sysfs one (actually kn->active refcount)
> use rtnl_trylock and return early (using restart_syscall)[3] which can
> make syscalls to spin for a long time when there is contention on the
> rtnl lock[4].
>
> There are not many possibilities to improve the above:
> - Rework the entire net/ locking logic.
> - Invert two locks in one of the paths — not possible.
>
> But here it's actually possible to drop one of the locks safely: the
> kernfs_node refcount. More details in the code itself, which comes with
> lots of comments.
>
> [1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/
> [2] https://lore.kernel.org/netdev/m14oyhis31.fsf@fess.ebiederm.org/
> [3] https://lore.kernel.org/netdev/20090226084924.16cb3e08@nehalam/
> [4] https://lore.kernel.org/all/20210928125500.167943-1-atenart@kernel.org/T/
>
> Signed-off-by: Antoine Tenart <atenart@kernel.org>
> ---
> net/core/net-sysfs.c | 142 ++++++++++++++++++++++++++++++++++---------
> 1 file changed, 112 insertions(+), 30 deletions(-)
>
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index fccaa5bac0ed..76d8729340b7 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -40,6 +40,88 @@ static inline int dev_isalive(const struct net_device *dev)
> return dev->reg_state <= NETREG_REGISTERED;
> }
>
> +/* We have a possible ABBA deadlock between rtnl_lock and kernfs_node->active,
> + * when unregistering a net device and accessing associated sysfs files. Tx/Rx
> + * queues do embed their own kobject for their sysfs files but the same issue
> + * applies as a net device being unresgistered will remove those sysfs files as
> + * well. The potential deadlock is as follow:
> + *
> + * CPU 0 CPU 1
> + *
> + * rtnl_lock vfs_read
> + * unregister_netdevice_many kernfs_seq_start
> + * device_del / kobject_put kernfs_get_active (kn->active++)
> + * kernfs_drain sysfs_kf_seq_show
> + * wait_event( rtnl_lock
> + * kn->active == KN_DEACTIVATED_BIAS) -> waits on CPU 0 to release
> + * -> waits on CPU 1 to decrease kn->active the rtnl lock.
> + *
> + * The historical fix was to use rtnl_trylock with restart_syscall to bail out
> + * of sysfs accesses when the lock wasn't taken. This fixed the above issue as
> + * it allowed CPU 1 to bail out of the ABBA situation.
> + *
> + * But this comes with performances issues, as syscalls are being restarted in
> + * loops when there is contention, with huge slow downs in specific scenarios
> + * (e.g. lots of virtual interfaces created at startup and userspace daemons
> + * querying their attributes).
> + *
> + * The idea below is to bail out of the active kernfs_node protection
> + * (kn->active) while still being in a safe position to continue the execution
> + * of our sysfs helpers.
> + */
> +static inline struct kernfs_node *sysfs_rtnl_lock(struct kobject *kobj,
> + struct attribute *attr,
> + struct net_device *ndev)
> +{
> + struct kernfs_node *kn;
> +
> + /* First, we hold a reference to the net device we might use in the
> + * locking section as the unregistration path might run in parallel.
> + * This will ensure the net device won't be freed before we return.
> + */
> + dev_hold(ndev);
> + /* sysfs_break_active_protection was introduced to allow self-removal of
> + * devices and their associated sysfs files by bailing out of the
> + * sysfs/kernfs protection. We do this here to allow the unregistration
> + * path to complete in parallel. The following takes a reference on the
> + * kobject and the kernfs_node being accessed.
> + *
> + * This works because we hold a reference onto the net device and the
> + * unregistration path will wait for us eventually in netdev_run_todo
> + * (outside an rtnl lock section).
> + */
> + kn = sysfs_break_active_protection(kobj, attr);
> + WARN_ON_ONCE(!kn);
If this triggers, you will end up rebooting the machines that set
panic-on-warn, do you mean to do that? And note, the huge majority of
Linux systems in the world have that enabled, so be careful.
thanks,
greg k-h
next prev parent reply other threads:[~2023-10-18 16:49 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-18 15:47 [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Antoine Tenart
2023-10-18 15:47 ` [RFC PATCH net-next 1/4] net-sysfs: remove rtnl_trylock from device attributes Antoine Tenart
2023-10-18 16:49 ` Greg KH [this message]
2023-10-19 8:13 ` Antoine Tenart
2023-10-19 15:37 ` Greg KH
2023-10-18 18:13 ` Stephen Hemminger
2023-10-19 7:48 ` Antoine Tenart
2025-01-02 22:36 ` Jakub Kicinski
2025-01-03 8:41 ` Eric Dumazet
2025-01-07 16:30 ` Antoine Tenart
2025-01-07 17:06 ` Jakub Kicinski
2025-01-16 13:36 ` Antoine Tenart
2025-01-16 23:42 ` Jakub Kicinski
2025-01-16 23:43 ` Jakub Kicinski
2025-01-17 8:26 ` Antoine Tenart
2023-10-18 15:47 ` [RFC PATCH net-next 2/4] net-sysfs: move queue attribute groups outside the default groups Antoine Tenart
2023-10-18 15:47 ` [RFC PATCH net-next 3/4] net-sysfs: prevent uncleared queues from being re-added Antoine Tenart
2023-10-18 15:47 ` [RFC PATCH net-next 4/4] net-sysfs: remove rtnl_trylock from queue attributes Antoine Tenart
2023-10-18 15:57 ` [RFC PATCH net-next 0/4] net-sysfs: remove rtnl_trylock/restart_syscall use Stephen Hemminger
2023-10-18 16:34 ` Antoine Tenart
2023-10-18 18:15 ` Stephen Hemminger
2023-10-19 7:47 ` Antoine Tenart
2023-10-19 14:54 ` Stephen Hemminger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2023101840-scabbed-visitor-3fdd@gregkh \
--to=gregkh@linuxfoundation.org \
--cc=atenart@kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kuba@kernel.org \
--cc=mhocko@suse.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=stephen@networkplumber.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.