netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hangbin Liu <liuhangbin@gmail.com>
To: Cosmin Ratiu <cratiu@nvidia.com>
Cc: "shuah@kernel.org" <shuah@kernel.org>,
	"andrew+netdev@lunn.ch" <andrew+netdev@lunn.ch>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"jv@jvosburgh.net" <jv@jvosburgh.net>,
	"jarod@redhat.com" <jarod@redhat.com>,
	"razor@blackwall.org" <razor@blackwall.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"pabeni@redhat.com" <pabeni@redhat.com>,
	Jianbo Liu <jianbol@nvidia.com>,
	"edumazet@google.com" <edumazet@google.com>,
	"horms@kernel.org" <horms@kernel.org>,
	"kuba@kernel.org" <kuba@kernel.org>,
	Tariq Toukan <tariqt@nvidia.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"steffen.klassert@secunet.com" <steffen.klassert@secunet.com>,
	"linux-kselftest@vger.kernel.org"
	<linux-kselftest@vger.kernel.org>
Subject: Re: [PATCHv2 net 1/3] bonding: move mutex lock to a work queue for XFRM GC tasks
Date: Wed, 26 Feb 2025 12:07:31 +0000	[thread overview]
Message-ID: <Z78EA2LEuFAwufNJ@fedora> (raw)
In-Reply-To: <0605dc53cdcee5ea71b89114f2318dd5d0a83276.camel@nvidia.com>

On Wed, Feb 26, 2025 at 11:05:47AM +0000, Cosmin Ratiu wrote:
> > > What do you think about this idea?
> > 
> > Thanks a lot for the comments. I also skipped the DEAD xs in
> > add_sa_all.
> > What about the patch like:
> 
> This is what I had in mind, thanks for proposing it. Maybe you should
> package it in a new submission with a proper title/etc.?
> I'll do the initial review here.

This is a draft patch and I think there may have something need to be fixed.
So I just paste it here :)

> 
> > 
> > diff --git a/drivers/net/bonding/bond_main.c
> > b/drivers/net/bonding/bond_main.c
> > index e45bba240cbc..0e4db43a833a 100644
> > --- a/drivers/net/bonding/bond_main.c
> > +++ b/drivers/net/bonding/bond_main.c
> > @@ -537,6 +537,12 @@ static void bond_ipsec_add_sa_all(struct bonding
> > *bond)
> >  	}
> >  
> >  	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> > +		/* No need to handle DEAD XFRM, as it has already
> > been
> > +		 * deleted and will be freed later.
> > +		 */
> 
> Nit: Maybe rephrase that as "Skip dead xfrm states, they'll be freed
> later."
> 
> > +		if (ipsec->xs->km.state == XFRM_STATE_DEAD)
> > +			continue;
> > +
> >  		/* If new state is added before ipsec_lock acquired
> > */
> >  		if (ipsec->xs->xso.real_dev == real_dev)
> >  			continue;
> > @@ -592,15 +598,6 @@ static void bond_ipsec_del_sa(struct xfrm_state
> > *xs)
> >  	real_dev->xfrmdev_ops->xdo_dev_state_delete(xs);
> >  out:
> >  	netdev_put(real_dev, &tracker);
> > -	mutex_lock(&bond->ipsec_lock);
> > -	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> > -		if (ipsec->xs == xs) {
> > -			list_del(&ipsec->list);
> > -			kfree(ipsec);
> > -			break;
> > -		}
> > -	}
> > -	mutex_unlock(&bond->ipsec_lock);
> >  }
> >  
> >  static void bond_ipsec_del_sa_all(struct bonding *bond)
> > @@ -617,6 +614,12 @@ static void bond_ipsec_del_sa_all(struct bonding
> > *bond)
> >  
> >  	mutex_lock(&bond->ipsec_lock);
> >  	list_for_each_entry(ipsec, &bond->ipsec_list, list) {
> > +		/* No need to handle DEAD XFRM, as it has already
> > been
> > +		 * deleted and will be freed later.
> > +		 */
> > +		if (ipsec->xs->km.state == XFRM_STATE_DEAD)
> > +			continue;
> > +
> 
> If this doesn't free dead entries now and bond_ipsec_add_sa_all is
> called soon after, the pending bond_ipsec_free_sa() call will then hit
> the WARN_ON(xs->xso.real_dev != real_dev) before attempting to call
> free on the wrong device.
> To fix that, these entries should be freed here and the WARN_ON in
> bond_ipsec_free_sa() should be converted to an if...goto out, so that 
> bond_ipsec_free_sa() calls would hit one of these conditions:
> 1. "if (!slave)", when no active device exists.
> 2. "if (!xs->xso.real_dev)", when xdo_dev_state_add() failed.
> 3. "if (xs->xso.real_dev != real_dev)", when a DEAD xs was already
> freed by bond_ipsec_del_sa_all() migration to a new device.
> In all 3 cases, xdo_dev_state_free() shouldn't be called, only xs
> removed from the bond->ipsec list.
> 
> I hope I didn't miss any corner case.

Thumb up! Thanks a lot for your review and comments. You thought much more
than me. During bonding testing, we also found a case that would trigger
the WARN_ON(xs->xso.real_dev != real_dev).

If we create active-backup mode bonding and create ipsec tunnel over
bonding device, then remove bonding device. There is a possibility that
the bond call bond_ipsec_del_sa_all() to delete the ipsec state first,
then change active slave to another interface.

At the same time, ipsec gc was called and then bond_ipsec_free_sa().
This will cause the xs->xso.real_dev != active_slave as the failover
triggered. The call traces looks like:

[14504.421247] bond0: (slave enp23s0f1np1): Enslaving as a backup interface with an up link
[14506.761933] mlx5_core 0000:17:00.0: lag map active ports: 1
[14506.767520] mlx5_core 0000:17:00.0: shared_fdb:0 mode:hash
[14550.992133] bond0: (slave enp23s0f0np0): Releasing backup interface
[14550.994150] mlx5_core 0000:17:00.0: lag map active ports: 1, 2
[14550.998407] bond0: (slave enp23s0f1np1): making interface the new active one
[14551.013286] ------------[ cut here ]------------
[14551.017912] WARNING: CPU: 7 PID: 1537 at drivers/net/bonding/bond_main.c:664 bond_ipsec_free_sa+0x9b/0xa0 [bonding]
[14551.117875] Unloaded tainted modules: bonding(E):33 fjes(E):1 padlock_aes(E):2 [last unloaded: bonding(E)]
[14551.148449] CPU: 7 UID: 0 PID: 1537 Comm: kworker/7:2 Kdump: loaded Tainted: G            E      6.13.0-rc7+ #5
[14551.158536] Tainted: [E]=UNSIGNED_MODULE
[14551.162461] Hardware name: Dell Inc. PowerEdge R750/0WT8Y6, BIOS 1.5.4 12/17/2021
[14551.169941] Workqueue: events xfrm_state_gc_task
[14551.174559] RIP: 0010:bond_ipsec_free_sa+0x9b/0xa0 [bonding]
[14551.180227] Code: 8b 85 38 05 00 00 65 ff 08 5b 5d c3 cc cc cc cc 5b 5d e9 e8 e3 01 da e8 e3 e3 01 da 48 83 bb b0 02 00 00 00 74 e3 0f 0b eb df <0f> 0b eb b4 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3
[14551.198972] RSP: 0018:ff61163a49eb3e00 EFLAGS: 00010287
[14551.204200] RAX: ff42be3fe4bd8000 RBX: ff42be3fa7359d40 RCX: 00000000802a0025
[14551.211336] RDX: ff42be4edc534280 RSI: 00000000fffffe00 RDI: ff42be4edc534280
[14551.218476] RBP: ff42be3f50128000 R08: 0000000000000000 R09: 0000000000000001
[14551.225606] R10: 00000000802a0025 R11: ff42be404d917f60 R12: ff42be5e7edb4e80
[14551.232740] R13: ff42be4edc534280 R14: ffffffff9db3db40 R15: 0000000000000000
[14551.239872] FS:  0000000000000000(0000) GS:ff42be5e7ed80000(0000) knlGS:0000000000000000
[14551.247957] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14551.253704] CR2: 00007fff69f55df0 CR3: 0000001158a22002 CR4: 0000000000773ef0
[14551.260836] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[14551.267970] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[14551.275101] PKRU: 55555554
[14551.277814] Call Trace:
[14551.280268]  <TASK>
[14551.282374]  ? show_trace_log_lvl+0x1b0/0x2f0
[14551.286742]  ? show_trace_log_lvl+0x1b0/0x2f0
[14551.291102]  ? xfrm_dev_state_free+0x84/0xb0
[14551.295374]  ? bond_ipsec_free_sa+0x9b/0xa0 [bonding]
[14551.300435]  ? __warn.cold+0x93/0xf4
[14551.304020]  ? bond_ipsec_free_sa+0x9b/0xa0 [bonding]
[14551.309076]  ? report_bug+0xff/0x140
[14551.312662]  ? handle_bug+0x53/0x90
[14551.316157]  ? exc_invalid_op+0x17/0x70
[14551.319994]  ? asm_exc_invalid_op+0x1a/0x20
[14551.324183]  ? bond_ipsec_free_sa+0x9b/0xa0 [bonding]
[14551.329242]  xfrm_dev_state_free+0x84/0xb0
[14551.333343]  ___xfrm_state_destroy+0xe3/0x160
[14551.337701]  xfrm_state_gc_task+0x7a/0xb0
[14551.341713]  process_one_work+0x174/0x330
[14551.345729]  worker_thread+0x252/0x390
[14551.349487]  ? __pfx_worker_thread+0x10/0x10
[14551.353761]  kthread+0xcf/0x100
[14551.356908]  ? __pfx_kthread+0x10/0x10
[14551.360668]  ret_from_fork+0x31/0x50
[14551.364249]  ? __pfx_kthread+0x10/0x10
[14551.368009]  ret_from_fork_asm+0x1a/0x30
[14551.371943]  </TASK>
[14551.374136] ---[ end trace 0000000000000000 ]---
[14551.735092] bond0: (slave enp23s0f1np1): Releasing backup interface
[14552.110577] bond0 (unregistering): Released all slaves

This seems like another situation that could not simply fit
3. "if (xs->xso.real_dev != real_dev), goto out.
I'm not sure what's the xs->km.state should be during xfrm_state_gc_task().
Is it also set to XFRM_STATE_DEAD, because I didn't see it.

Especially if the bond change active slave and xfrm_state_gc_task() run
in parallel, like

  bond_ipsec_del_sa_all()
                                  xfrm_state_gc_task()
                                  xfrm_dev_state_free()
                                  bond_ipsec_free_sa()
  bond_ipsec_add_sa_all()

If the xs->km.state is not XFRM_STATE_DEAD. How to avoid the
WARN_ON(xs->xso.real_dev != real_dev) in bond_ipsec_free_sa()
and how to make bond_ipsec_add_sa_all() not added the entry again.

Thanks
Hangbin

  reply	other threads:[~2025-02-26 12:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-25  9:40 [PATCHv2 net 0/3] bond: fix xfrm offload issues Hangbin Liu
2025-02-25  9:40 ` [PATCHv2 net 1/3] bonding: move mutex lock to a work queue for XFRM GC tasks Hangbin Liu
2025-02-25 11:05   ` Nikolay Aleksandrov
2025-02-25 13:13     ` Hangbin Liu
2025-02-25 13:30       ` Nikolay Aleksandrov
2025-02-25 14:00   ` Cosmin Ratiu
2025-02-25 14:27     ` Nikolay Aleksandrov
2025-02-26  9:48     ` Hangbin Liu
2025-02-26 11:05       ` Cosmin Ratiu
2025-02-26 12:07         ` Hangbin Liu [this message]
2025-02-26 14:05           ` Cosmin Ratiu
2025-02-25  9:40 ` [PATCHv2 net 2/3] bonding: fix xfrm offload feature setup on active-backup mode Hangbin Liu
2025-02-25  9:40 ` [PATCHv2 net 3/3] selftests: bonding: add ipsec offload test Hangbin Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z78EA2LEuFAwufNJ@fedora \
    --to=liuhangbin@gmail.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=cratiu@nvidia.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=jarod@redhat.com \
    --cc=jianbol@nvidia.com \
    --cc=jv@jvosburgh.net \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=razor@blackwall.org \
    --cc=shuah@kernel.org \
    --cc=steffen.klassert@secunet.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).