Re: [PATCH net V5 3/3] bonding: change ipsec_lock from spin lock to mutex

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jay Vosburgh <jv@jvosburgh.net>
To: Jianbo Liu <jianbol@nvidia.com>
Cc: netdev@vger.kernel.org, davem@davemloft.net, kuba@kernel.org,
	pabeni@redhat.com, edumazet@google.com, andy@greyhouse.net,
	saeedm@nvidia.com, gal@nvidia.com, leonro@nvidia.com,
	liuhangbin@gmail.com, tariqt@nvidia.com
Subject: Re: [PATCH net V5 3/3] bonding: change ipsec_lock from spin lock to mutex
Date: Wed, 21 Aug 2024 09:00:30 -0700	[thread overview]
Message-ID: <120654.1724256030@famine> (raw)
In-Reply-To: <20240821090458.10813-4-jianbol@nvidia.com>

Jianbo Liu <jianbol@nvidia.com> wrote:

>In the cited commit, bond->ipsec_lock is added to protect ipsec_list,
>hence xdo_dev_state_add and xdo_dev_state_delete are called inside
>this lock. As ipsec_lock is a spin lock and such xfrmdev ops may sleep,
>"scheduling while atomic" will be triggered when changing bond's
>active slave.
>
>[  101.055189] BUG: scheduling while atomic: bash/902/0x00000200
>[  101.055726] Modules linked in:
>[  101.058211] CPU: 3 PID: 902 Comm: bash Not tainted 6.9.0-rc4+ #1
>[  101.058760] Hardware name:
>[  101.059434] Call Trace:
>[  101.059436]  <TASK>
>[  101.060873]  dump_stack_lvl+0x51/0x60
>[  101.061275]  __schedule_bug+0x4e/0x60
>[  101.061682]  __schedule+0x612/0x7c0
>[  101.062078]  ? __mod_timer+0x25c/0x370
>[  101.062486]  schedule+0x25/0xd0
>[  101.062845]  schedule_timeout+0x77/0xf0
>[  101.063265]  ? asm_common_interrupt+0x22/0x40
>[  101.063724]  ? __bpf_trace_itimer_state+0x10/0x10
>[  101.064215]  __wait_for_common+0x87/0x190
>[  101.064648]  ? usleep_range_state+0x90/0x90
>[  101.065091]  cmd_exec+0x437/0xb20 [mlx5_core]
>[  101.065569]  mlx5_cmd_do+0x1e/0x40 [mlx5_core]
>[  101.066051]  mlx5_cmd_exec+0x18/0x30 [mlx5_core]
>[  101.066552]  mlx5_crypto_create_dek_key+0xea/0x120 [mlx5_core]
>[  101.067163]  ? bonding_sysfs_store_option+0x4d/0x80 [bonding]
>[  101.067738]  ? kmalloc_trace+0x4d/0x350
>[  101.068156]  mlx5_ipsec_create_sa_ctx+0x33/0x100 [mlx5_core]
>[  101.068747]  mlx5e_xfrm_add_state+0x47b/0xaa0 [mlx5_core]
>[  101.069312]  bond_change_active_slave+0x392/0x900 [bonding]
>[  101.069868]  bond_option_active_slave_set+0x1c2/0x240 [bonding]
>[  101.070454]  __bond_opt_set+0xa6/0x430 [bonding]
>[  101.070935]  __bond_opt_set_notify+0x2f/0x90 [bonding]
>[  101.071453]  bond_opt_tryset_rtnl+0x72/0xb0 [bonding]
>[  101.071965]  bonding_sysfs_store_option+0x4d/0x80 [bonding]
>[  101.072567]  kernfs_fop_write_iter+0x10c/0x1a0
>[  101.073033]  vfs_write+0x2d8/0x400
>[  101.073416]  ? alloc_fd+0x48/0x180
>[  101.073798]  ksys_write+0x5f/0xe0
>[  101.074175]  do_syscall_64+0x52/0x110
>[  101.074576]  entry_SYSCALL_64_after_hwframe+0x4b/0x53
>
>As bond_ipsec_add_sa_all and bond_ipsec_del_sa_all are only called
>from bond_change_active_slave, which requires holding the RTNL lock.
>And bond_ipsec_add_sa and bond_ipsec_del_sa are xfrm state
>xdo_dev_state_add and xdo_dev_state_delete APIs, which are in user
>context. So ipsec_lock doesn't have to be spin lock, change it to
>mutex, and thus the above issue can be resolved.
>
>Fixes: 9a5605505d9c ("bonding: Add struct bond_ipesc to manage SA")
>Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
>Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
>Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
>---
> drivers/net/bonding/bond_main.c | 67 +++++++++++++++------------------
> include/net/bonding.h           |  2 +-
> 2 files changed, 32 insertions(+), 37 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 0d1129eaf47b..f20f6d83ad54 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -439,38 +439,33 @@ static int bond_ipsec_add_sa(struct xfrm_state *xs,
> 	rcu_read_lock();
> 	bond = netdev_priv(bond_dev);
> 	slave = rcu_dereference(bond->curr_active_slave);
>-	if (!slave) {
>-		rcu_read_unlock();
>+	real_dev = slave ? slave->dev : NULL;
>+	rcu_read_unlock();
>+	if (!real_dev)
> 		return -ENODEV;

	In reading these, I was confused as to why some changes use
rcu_read_lock(), rcu_dereference() and others use rtnl_dereference(); I
think it's because bond_ipsec_{add,del}_sa_all() are guaranteed to be
called under RTNL, while the bond_ipsec_{add,del}_sa() functions are do
not have that guarantee.  Am I understanding correctly?

>-	}
> 
>-	real_dev = slave->dev;
> 	if (!real_dev->xfrmdev_ops ||
> 	    !real_dev->xfrmdev_ops->xdo_dev_state_add ||
> 	    netif_is_bond_master(real_dev)) {
> 		NL_SET_ERR_MSG_MOD(extack, "Slave does not support ipsec offload");
>-		rcu_read_unlock();
> 		return -EINVAL;
> 	}
> 
>-	ipsec = kmalloc(sizeof(*ipsec), GFP_ATOMIC);
>-	if (!ipsec) {
>-		rcu_read_unlock();
>+	ipsec = kmalloc(sizeof(*ipsec), GFP_KERNEL);
>+	if (!ipsec)
> 		return -ENOMEM;

	Presumably the switch from ATOMIC to KERNEL is safe because this
is only called under RTNL (and therefore always has a process context),
i.e., this change is independent of any other changes in the patch.
Correct?

>-	}
> 
> 	xs->xso.real_dev = real_dev;
> 	err = real_dev->xfrmdev_ops->xdo_dev_state_add(xs, extack);
> 	if (!err) {
> 		ipsec->xs = xs;
> 		INIT_LIST_HEAD(&ipsec->list);
>-		spin_lock_bh(&bond->ipsec_lock);
>+		mutex_lock(&bond->ipsec_lock);
> 		list_add(&ipsec->list, &bond->ipsec_list);
>-		spin_unlock_bh(&bond->ipsec_lock);
>+		mutex_unlock(&bond->ipsec_lock);
> 	} else {
> 		kfree(ipsec);
> 	}
>-	rcu_read_unlock();
> 	return err;
> }
> 
>@@ -481,35 +476,35 @@ static void bond_ipsec_add_sa_all(struct bonding *bond)
> 	struct bond_ipsec *ipsec;
> 	struct slave *slave;
> 
>-	rcu_read_lock();
>-	slave = rcu_dereference(bond->curr_active_slave);
>-	if (!slave)
>-		goto out;
>+	slave = rtnl_dereference(bond->curr_active_slave);
>+	real_dev = slave ? slave->dev : NULL;
>+	if (!real_dev)
>+		return;
> 
>-	real_dev = slave->dev;
>+	mutex_lock(&bond->ipsec_lock);
> 	if (!real_dev->xfrmdev_ops ||
> 	    !real_dev->xfrmdev_ops->xdo_dev_state_add ||
> 	    netif_is_bond_master(real_dev)) {
>-		spin_lock_bh(&bond->ipsec_lock);
> 		if (!list_empty(&bond->ipsec_list))
> 			slave_warn(bond_dev, real_dev,
> 				   "%s: no slave xdo_dev_state_add\n",
> 				   __func__);
>-		spin_unlock_bh(&bond->ipsec_lock);
> 		goto out;
> 	}
> 
>-	spin_lock_bh(&bond->ipsec_lock);
> 	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
>+		/* If new state is added before ipsec_lock acquired */
>+		if (ipsec->xs->xso.real_dev == real_dev)
>+			continue;
>+
> 		ipsec->xs->xso.real_dev = real_dev;
> 		if (real_dev->xfrmdev_ops->xdo_dev_state_add(ipsec->xs, NULL)) {
> 			slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__);
> 			ipsec->xs->xso.real_dev = NULL;
> 		}
> 	}
>-	spin_unlock_bh(&bond->ipsec_lock);
> out:
>-	rcu_read_unlock();
>+	mutex_unlock(&bond->ipsec_lock);
> }
> 
> /**
>@@ -530,6 +525,8 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs)
> 	rcu_read_lock();
> 	bond = netdev_priv(bond_dev);
> 	slave = rcu_dereference(bond->curr_active_slave);
>+	real_dev = slave ? slave->dev : NULL;
>+	rcu_read_unlock();

	Is it really safe to access real_dev once we've left the rcu
critical section?  What prevents the device referenced by real_dev from
being deleted as soon as rcu_read_unlock() completes?

	-J
	
> 
> 	if (!slave)
> 		goto out;
>@@ -537,7 +534,6 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs)
> 	if (!xs->xso.real_dev)
> 		goto out;
> 
>-	real_dev = slave->dev;
> 	WARN_ON(xs->xso.real_dev != real_dev);
> 
> 	if (!real_dev->xfrmdev_ops ||
>@@ -549,7 +545,7 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs)
> 
> 	real_dev->xfrmdev_ops->xdo_dev_state_delete(xs);
> out:
>-	spin_lock_bh(&bond->ipsec_lock);
>+	mutex_lock(&bond->ipsec_lock);
> 	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> 		if (ipsec->xs == xs) {
> 			list_del(&ipsec->list);
>@@ -557,8 +553,7 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs)
> 			break;
> 		}
> 	}
>-	spin_unlock_bh(&bond->ipsec_lock);
>-	rcu_read_unlock();
>+	mutex_unlock(&bond->ipsec_lock);
> }
> 
> static void bond_ipsec_del_sa_all(struct bonding *bond)
>@@ -568,15 +563,12 @@ static void bond_ipsec_del_sa_all(struct bonding *bond)
> 	struct bond_ipsec *ipsec;
> 	struct slave *slave;
> 
>-	rcu_read_lock();
>-	slave = rcu_dereference(bond->curr_active_slave);
>-	if (!slave) {
>-		rcu_read_unlock();
>+	slave = rtnl_dereference(bond->curr_active_slave);
>+	real_dev = slave ? slave->dev : NULL;
>+	if (!real_dev)
> 		return;
>-	}
> 
>-	real_dev = slave->dev;
>-	spin_lock_bh(&bond->ipsec_lock);
>+	mutex_lock(&bond->ipsec_lock);
> 	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> 		if (!ipsec->xs->xso.real_dev)
> 			continue;
>@@ -593,8 +585,7 @@ static void bond_ipsec_del_sa_all(struct bonding *bond)
> 				real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs);
> 		}
> 	}
>-	spin_unlock_bh(&bond->ipsec_lock);
>-	rcu_read_unlock();
>+	mutex_unlock(&bond->ipsec_lock);
> }
> 
> static void bond_ipsec_free_sa(struct xfrm_state *xs)
>@@ -5917,7 +5908,7 @@ void bond_setup(struct net_device *bond_dev)
> 	/* set up xfrm device ops (only supported in active-backup right now) */
> 	bond_dev->xfrmdev_ops = &bond_xfrmdev_ops;
> 	INIT_LIST_HEAD(&bond->ipsec_list);
>-	spin_lock_init(&bond->ipsec_lock);
>+	mutex_init(&bond->ipsec_lock);
> #endif /* CONFIG_XFRM_OFFLOAD */
> 
> 	/* don't acquire bond device's netif_tx_lock when transmitting */
>@@ -5966,6 +5957,10 @@ static void bond_uninit(struct net_device *bond_dev)
> 		__bond_release_one(bond_dev, slave->dev, true, true);
> 	netdev_info(bond_dev, "Released all slaves\n");
> 
>+#ifdef CONFIG_XFRM_OFFLOAD
>+	mutex_destroy(&bond->ipsec_lock);
>+#endif /* CONFIG_XFRM_OFFLOAD */
>+
> 	bond_set_slave_arr(bond, NULL, NULL);
> 
> 	list_del_rcu(&bond->bond_list);
>diff --git a/include/net/bonding.h b/include/net/bonding.h
>index b61fb1aa3a56..8bb5f016969f 100644
>--- a/include/net/bonding.h
>+++ b/include/net/bonding.h
>@@ -260,7 +260,7 @@ struct bonding {
> #ifdef CONFIG_XFRM_OFFLOAD
> 	struct list_head ipsec_list;
> 	/* protecting ipsec_list */
>-	spinlock_t ipsec_lock;
>+	struct mutex ipsec_lock;
> #endif /* CONFIG_XFRM_OFFLOAD */
> 	struct bpf_prog *xdp_prog;
> };
>-- 
>2.21.0
>

---
	-Jay Vosburgh, jv@jvosburgh.net

next prev parent reply	other threads:[~2024-08-21 16:00 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-21  9:04 [PATCH net V5 0/3] Fixes for IPsec over bonding Jianbo Liu
2024-08-21  9:04 ` [PATCH net V5 1/3] bonding: implement xdo_dev_state_free and call it after deletion Jianbo Liu
2024-08-21  9:04 ` [PATCH net V5 2/3] bonding: extract the use of real_device into local variable Jianbo Liu
2024-08-21  9:04 ` [PATCH net V5 3/3] bonding: change ipsec_lock from spin lock to mutex Jianbo Liu
2024-08-21 16:00   ` Jay Vosburgh [this message]
2024-08-22  0:11     ` Jakub Kicinski
2024-08-22  2:07       ` Jianbo Liu
2024-08-22  1:53     ` Jianbo Liu
2024-08-22  6:05       ` Jay Vosburgh
2024-08-22 11:15         ` Jianbo Liu
2024-08-22 23:19           ` Jakub Kicinski
2024-08-23  0:17           ` Jay Vosburgh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=120654.1724256030@famine \
    --to=jv@jvosburgh.net \
    --cc=andy@greyhouse.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gal@nvidia.com \
    --cc=jianbol@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=liuhangbin@gmail.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=saeedm@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.