All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hangbin Liu <liuhangbin@gmail.com>
To: Cosmin Ratiu <cratiu@nvidia.com>
Cc: "shuah@kernel.org" <shuah@kernel.org>,
	"andrew+netdev@lunn.ch" <andrew+netdev@lunn.ch>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"jv@jvosburgh.net" <jv@jvosburgh.net>,
	"jarod@redhat.com" <jarod@redhat.com>,
	"razor@blackwall.org" <razor@blackwall.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"pabeni@redhat.com" <pabeni@redhat.com>,
	Jianbo Liu <jianbol@nvidia.com>,
	"edumazet@google.com" <edumazet@google.com>,
	"horms@kernel.org" <horms@kernel.org>,
	"kuba@kernel.org" <kuba@kernel.org>,
	Tariq Toukan <tariqt@nvidia.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"steffen.klassert@secunet.com" <steffen.klassert@secunet.com>,
	"linux-kselftest@vger.kernel.org"
	<linux-kselftest@vger.kernel.org>
Subject: Re: [PATCHv2 net 1/3] bonding: move mutex lock to a work queue for XFRM GC tasks
Date: Wed, 26 Feb 2025 12:07:31 +0000	[thread overview]
Message-ID: <Z78EA2LEuFAwufNJ@fedora> (raw)
In-Reply-To: <0605dc53cdcee5ea71b89114f2318dd5d0a83276.camel@nvidia.com>

On Wed, Feb 26, 2025 at 11:05:47AM +0000, Cosmin Ratiu wrote:
> > > What do you think about this idea?
> > 
> > Thanks a lot for the comments. I also skipped the DEAD xs in
> > add_sa_all.
> > What about the patch like:
> 
> This is what I had in mind, thanks for proposing it. Maybe you should
> package it in a new submission with a proper title/etc.?
> I'll do the initial review here.

This is a draft patch and I think there may have something need to be fixed.
So I just paste it here :)

> 
> > 
> > diff --git a/drivers/net/bonding/bond_main.c
> > b/drivers/net/bonding/bond_main.c
> > index e45bba240cbc..0e4db43a833a 100644
> > --- a/drivers/net/bonding/bond_main.c
> > +++ b/drivers/net/bonding/bond_main.c
> > @@ -537,6 +537,12 @@ static void bond_ipsec_add_sa_all(struct bonding
> > *bond)
> >  	}
> >  
> >  	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> > +		/* No need to handle DEAD XFRM, as it has already
> > been
> > +		 * deleted and will be freed later.
> > +		 */
> 
> Nit: Maybe rephrase that as "Skip dead xfrm states, they'll be freed
> later."
> 
> > +		if (ipsec->xs->km.state == XFRM_STATE_DEAD)
> > +			continue;
> > +
> >  		/* If new state is added before ipsec_lock acquired
> > */
> >  		if (ipsec->xs->xso.real_dev == real_dev)
> >  			continue;
> > @@ -592,15 +598,6 @@ static void bond_ipsec_del_sa(struct xfrm_state
> > *xs)
> >  	real_dev->xfrmdev_ops->xdo_dev_state_delete(xs);
> >  out:
> >  	netdev_put(real_dev, &tracker);
> > -	mutex_lock(&bond->ipsec_lock);
> > -	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> > -		if (ipsec->xs == xs) {
> > -			list_del(&ipsec->list);
> > -			kfree(ipsec);
> > -			break;
> > -		}
> > -	}
> > -	mutex_unlock(&bond->ipsec_lock);
> >  }
> >  
> >  static void bond_ipsec_del_sa_all(struct bonding *bond)
> > @@ -617,6 +614,12 @@ static void bond_ipsec_del_sa_all(struct bonding
> > *bond)
> >  
> >  	mutex_lock(&bond->ipsec_lock);
> >  	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> > +		/* No need to handle DEAD XFRM, as it has already
> > been
> > +		 * deleted and will be freed later.
> > +		 */
> > +		if (ipsec->xs->km.state == XFRM_STATE_DEAD)
> > +			continue;
> > +
> 
> If this doesn't free dead entries now and bond_ipsec_add_sa_all is
> called soon after, the pending bond_ipsec_free_sa() call will then hit
> the WARN_ON(xs->xso.real_dev != real_dev) before attempting to call
> free on the wrong device.
> To fix that, these entries should be freed here and the WARN_ON in
> bond_ipsec_free_sa() should be converted to an if...goto out, so that 
> bond_ipsec_free_sa() calls would hit one of these conditions:
> 1. "if (!slave)", when no active device exists.
> 2. "if (!xs->xso.real_dev)", when xdo_dev_state_add() failed.
> 3. "if (xs->xso.real_dev != real_dev)", when a DEAD xs was already
> freed by bond_ipsec_del_sa_all() migration to a new device.
> In all 3 cases, xdo_dev_state_free() shouldn't be called, only xs
> removed from the bond->ipsec list.
> 
> I hope I didn't miss any corner case.

Thumb up! Thanks a lot for your review and comments. You thought much more
than me. During bonding testing, we also found a case that would trigger
the WARN_ON(xs->xso.real_dev != real_dev).

If we create active-backup mode bonding and create ipsec tunnel over
bonding device, then remove bonding device. There is a possibility that
the bond call bond_ipsec_del_sa_all() to delete the ipsec state first,
then change active slave to another interface.

At the same time, ipsec gc was called and then bond_ipsec_free_sa().
This will cause the xs->xso.real_dev != active_slave as the failover
triggered. The call traces looks like:

[14504.421247] bond0: (slave enp23s0f1np1): Enslaving as a backup interface with an up link
[14506.761933] mlx5_core 0000:17:00.0: lag map active ports: 1
[14506.767520] mlx5_core 0000:17:00.0: shared_fdb:0 mode:hash
[14550.992133] bond0: (slave enp23s0f0np0): Releasing backup interface
[14550.994150] mlx5_core 0000:17:00.0: lag map active ports: 1, 2
[14550.998407] bond0: (slave enp23s0f1np1): making interface the new active one
[14551.013286] ------------[ cut here ]------------
[14551.017912] WARNING: CPU: 7 PID: 1537 at drivers/net/bonding/bond_main.c:664 bond_ipsec_free_sa+0x9b/0xa0 [bonding]
[14551.117875] Unloaded tainted modules: bonding(E):33 fjes(E):1 padlock_aes(E):2 [last unloaded: bonding(E)]
[14551.148449] CPU: 7 UID: 0 PID: 1537 Comm: kworker/7:2 Kdump: loaded Tainted: G            E      6.13.0-rc7+ #5
[14551.158536] Tainted: [E]=UNSIGNED_MODULE
[14551.162461] Hardware name: Dell Inc. PowerEdge R750/0WT8Y6, BIOS 1.5.4 12/17/2021
[14551.169941] Workqueue: events xfrm_state_gc_task
[14551.174559] RIP: 0010:bond_ipsec_free_sa+0x9b/0xa0 [bonding]
[14551.180227] Code: 8b 85 38 05 00 00 65 ff 08 5b 5d c3 cc cc cc cc 5b 5d e9 e8 e3 01 da e8 e3 e3 01 da 48 83 bb b0 02 00 00 00 74 e3 0f 0b eb df <0f> 0b eb b4 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
[14551.198972] RSP: 0018:ff61163a49eb3e00 EFLAGS: 00010287
[14551.204200] RAX: ff42be3fe4bd8000 RBX: ff42be3fa7359d40 RCX: 00000000802a0025
[14551.211336] RDX: ff42be4edc534280 RSI: 00000000fffffe00 RDI: ff42be4edc534280
[14551.218476] RBP: ff42be3f50128000 R08: 0000000000000000 R09: 0000000000000001
[14551.225606] R10: 00000000802a0025 R11: ff42be404d917f60 R12: ff42be5e7edb4e80
[14551.232740] R13: ff42be4edc534280 R14: ffffffff9db3db40 R15: 0000000000000000
[14551.239872] FS:  0000000000000000(0000) GS:ff42be5e7ed80000(0000) knlGS:0000000000000000
[14551.247957] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14551.253704] CR2: 00007fff69f55df0 CR3: 0000001158a22002 CR4: 0000000000773ef0
[14551.260836] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[14551.267970] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[14551.275101] PKRU: 55555554
[14551.277814] Call Trace:
[14551.280268]  <TASK>
[14551.282374]  ? show_trace_log_lvl+0x1b0/0x2f0
[14551.286742]  ? show_trace_log_lvl+0x1b0/0x2f0
[14551.291102]  ? xfrm_dev_state_free+0x84/0xb0
[14551.295374]  ? bond_ipsec_free_sa+0x9b/0xa0 [bonding]
[14551.300435]  ? __warn.cold+0x93/0xf4
[14551.304020]  ? bond_ipsec_free_sa+0x9b/0xa0 [bonding]
[14551.309076]  ? report_bug+0xff/0x140
[14551.312662]  ? handle_bug+0x53/0x90
[14551.316157]  ? exc_invalid_op+0x17/0x70
[14551.319994]  ? asm_exc_invalid_op+0x1a/0x20
[14551.324183]  ? bond_ipsec_free_sa+0x9b/0xa0 [bonding]
[14551.329242]  xfrm_dev_state_free+0x84/0xb0
[14551.333343]  ___xfrm_state_destroy+0xe3/0x160
[14551.337701]  xfrm_state_gc_task+0x7a/0xb0
[14551.341713]  process_one_work+0x174/0x330
[14551.345729]  worker_thread+0x252/0x390
[14551.349487]  ? __pfx_worker_thread+0x10/0x10
[14551.353761]  kthread+0xcf/0x100
[14551.356908]  ? __pfx_kthread+0x10/0x10
[14551.360668]  ret_from_fork+0x31/0x50
[14551.364249]  ? __pfx_kthread+0x10/0x10
[14551.368009]  ret_from_fork_asm+0x1a/0x30
[14551.371943]  </TASK>
[14551.374136] ---[ end trace 0000000000000000 ]---
[14551.735092] bond0: (slave enp23s0f1np1): Releasing backup interface
[14552.110577] bond0 (unregistering): Released all slaves

This seems like another situation that could not simply fit
3. "if (xs->xso.real_dev != real_dev), goto out.
I'm not sure what's the xs->km.state should be during xfrm_state_gc_task().
Is it also set to XFRM_STATE_DEAD, because I didn't see it.

Especially if the bond change active slave and xfrm_state_gc_task() run
in parallel, like

  bond_ipsec_del_sa_all()
                                  xfrm_state_gc_task()
                                  xfrm_dev_state_free()
                                  bond_ipsec_free_sa()
  bond_ipsec_add_sa_all()

If the xs->km.state is not XFRM_STATE_DEAD. How to avoid the
WARN_ON(xs->xso.real_dev != real_dev) in bond_ipsec_free_sa()
and how to make bond_ipsec_add_sa_all() not added the entry again.

Thanks
Hangbin

  reply	other threads:[~2025-02-26 12:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-25  9:40 [PATCHv2 net 0/3] bond: fix xfrm offload issues Hangbin Liu
2025-02-25  9:40 ` [PATCHv2 net 1/3] bonding: move mutex lock to a work queue for XFRM GC tasks Hangbin Liu
2025-02-25 11:05   ` Nikolay Aleksandrov
2025-02-25 13:13     ` Hangbin Liu
2025-02-25 13:30       ` Nikolay Aleksandrov
2025-02-25 14:00   ` Cosmin Ratiu
2025-02-25 14:27     ` Nikolay Aleksandrov
2025-02-26  9:48     ` Hangbin Liu
2025-02-26 11:05       ` Cosmin Ratiu
2025-02-26 12:07         ` Hangbin Liu [this message]
2025-02-26 14:05           ` Cosmin Ratiu
2025-02-25  9:40 ` [PATCHv2 net 2/3] bonding: fix xfrm offload feature setup on active-backup mode Hangbin Liu
2025-02-25  9:40 ` [PATCHv2 net 3/3] selftests: bonding: add ipsec offload test Hangbin Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z78EA2LEuFAwufNJ@fedora \
    --to=liuhangbin@gmail.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=jarod@redhat.com \
    --cc=jianbol@nvidia.com \
    --cc=jv@jvosburgh.net \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=razor@blackwall.org \
    --cc=shuah@kernel.org \
    --cc=steffen.klassert@secunet.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.