All of lore.kernel.org
 help / color / mirror / Atom feed
From: Leon Romanovsky <leon@kernel.org>
To: Shifeng Li <lishifeng@sangfor.com.cn>
Cc: jgg@ziepe.ca, wenglianfa@huawei.com, gustavoars@kernel.org,
	linux-rdma@vger.kernel.org, linux-kernel@vger.kernel.org,
	Shifeng Li <lishifeng1992@126.com>
Subject: Re: [PATCH] RDMA/device: Fix a race between mad_client and cm_client init
Date: Tue, 2 Jan 2024 13:58:06 +0200	[thread overview]
Message-ID: <20240102115806.GH6361@unreal> (raw)
In-Reply-To: <61e69337-c32e-4df7-a31c-faf112f36466@sangfor.com.cn>

On Tue, Jan 02, 2024 at 07:33:41PM +0800, Shifeng Li wrote:
> On 2024/1/2 16:58, Leon Romanovsky wrote:
> > On Mon, Jan 01, 2024 at 07:43:35PM -0800, Shifeng Li wrote:
> > > The mad_client will be initialized in enable_device_and_get(), while the
> > > devices_rwsem will be downgraded to a read semaphore. There is a window
> > > that leads to the failed initialization for cm_client, since it can not
> > > get matched mad port from ib_mad_port_list, and the matched mad port will
> > > be added to the list after that.
> > > 
> > >      mad_client    |                       cm_client
> > > ------------------|--------------------------------------------------------
> > > ib_register_device|
> > > enable_device_and_get
> > > down_write(&devices_rwsem)
> > > xa_set_mark(&devices, DEVICE_REGISTERED)
> > > downgrade_write(&devices_rwsem)
> > >                    |
> > >                    |ib_cm_init
> > >                    |ib_register_client(&cm_client)
> > >                    |down_read(&devices_rwsem)
> > >                    |xa_for_each_marked (&devices, DEVICE_REGISTERED)
> > >                    |add_client_context
> > >                    |cm_add_one
> > >                    |ib_register_mad_agent
> > >                    |ib_get_mad_port
> > >                    |__ib_get_mad_port
> > >                    |list_for_each_entry(entry, &ib_mad_port_list, port_list)
> > >                    |return NULL
> > >                    |up_read(&devices_rwsem)
> > >                    |
> > > add_client_context|
> > > ib_mad_init_device|
> > > ib_mad_port_open  |
> > > list_add_tail(&port_priv->port_list, &ib_mad_port_list)
> > > up_read(&devices_rwsem)
> > >                    |
> > 
> > How is this stack possible?
> > 
> > ib_register_device() is called by drivers and happens much later than ib_cm_init().
> > 
> 
> I've caught the stack and err log as follows, ib_mad_init_device() called after


ib_mad_init_device != ib_register_device

You mentioned ib_register_device() in your callstack, while you caught ib_mad_init_device().

Thanks

> cm_add_one().
> 
> [   98.281786] CPU: 18 PID: 30079 Comm: modprobe Kdump: loaded
> [   98.281787] Call Trace:
> [   98.281790]  dump_stack+0x71/0xab
> [   98.281805]  ib_register_mad_agent+0x1c6/0x27f0 [ib_core]
> [   98.281840]  cm_add_one+0x4b0/0x9d0 [ib_cm]
> [   98.281865]  add_client_context+0x2b9/0x380 [ib_core]
> [   98.281890]  ib_register_client+0x22a/0x2a0 [ib_core]
> [   98.281908]  __init_backport+0x12f/0x1000 [ib_cm]
> [   98.281912]  do_one_initcall+0x87/0x2e2
> [   98.281926]  do_init_module+0x1c3/0x5f7
> [   98.281928]  load_module+0x4fe0/0x68d0
> [   98.281945]  __do_sys_finit_module+0x14d/0x180
> [   98.281952]  do_syscall_64+0xa0/0x370
> [   98.281957]  entry_SYSCALL_64_after_hwframe+0x65/0xca
> [   98.281958] RIP: 0033:0x7f51d3a4373d
> [   98.281961] RSP: 002b:00007ffceb925938 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
> [   98.281964] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f51d3a4373d
> [   98.281965] RDX: 0000000000000000 RSI: 0000000001636b70 RDI: 0000000000000003
> [   98.281966] RBP: 00007ffceb925950 R08: 0000000000000000 R09: 0000000001636b70
> [   98.281967] R10: 0000000000000003 R11: 0000000000000202 R12: 0000000000402410
> [   98.281968] R13: 00007ffceb926e60 R14: 0000000000000000 R15: 0000000000000000
> 
> [   98.281977] infiniband mlx5_1: ib_register_mad_agent: Invalid port 1
> 
> [   98.349040] CPU: 38 PID: 29896 Comm: modprobe Kdump: loaded
> [   98.349043] Call Trace:
> [   98.349053]  dump_stack+0x71/0xab
> [   98.349076]  ib_register_mad_agent+0x1c6/0x27f0 [ib_core]
> [   98.349149]  ib_agent_port_open+0xe2/0x2d0 [ib_core]
> [   98.349164]  ib_mad_init_device+0x818/0x1d70 [ib_core]
> [   98.349197]  add_client_context+0x2b9/0x380 [ib_core]
> [   98.349221]  enable_device_and_get+0x1ab/0x340 [ib_core]
> [   98.349292]  ib_register_device+0xcbf/0xfd0 [ib_core]
> [   98.349355]  __mlx5_ib_add+0x44/0xf0 [mlx5_ib]
> [   98.349404]  mlx5_add_device+0xc3/0x280 [mlx5_core]
> [   98.349434]  mlx5_register_interface+0x109/0x190 [mlx5_core]
> [   98.349442]  do_one_initcall+0x87/0x2e2
> [   98.349460]  do_init_module+0x1c3/0x5f7
> [   98.349462]  load_module+0x4fe0/0x68d0
> [   98.349481]  __do_sys_init_module+0x1f9/0x220
> [   98.349486]  do_syscall_64+0xa0/0x370
> [   98.349492]  entry_SYSCALL_64_after_hwframe+0x65/0xca
> [   98.349494] RIP: 0033:0x7fbc327faa7a
> [   98.349500] RSP: 002b:00007fff890df7f8 EFLAGS: 00000202 ORIG_RAX: 00000000000000af
> [   98.349502] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fbc327faa7a
> [   98.349503] RDX: 00000000004250c0 RSI: 00000000001b0230 RDI: 00007fbc31919010
> [   98.349505] RBP: 00007fff890df860 R08: 00000000025045b0 R09: 0000000000000000
> [   98.349506] R10: 00000000001b0230 R11: 0000000000000202 R12: 0000000000402410
> [   98.349507] R13: 00007fff890e0cf0 R14: 0000000000000000 R15: 0000000000000000
> 
> Thanks
> 
> > Thanks
> > 
> > > 
> > > Fix it by using the devices_rwsem write semaphore to protect the mad_client
> > > init flow in enable_device_and_get().
> > > 
> > > Fixes: d0899892edd0 ("RDMA/device: Provide APIs from the core code to help unregistration")
> > > Cc: Shifeng Li <lishifeng1992@126.com>
> > > Signed-off-by: Shifeng Li <lishifeng@sangfor.com.cn>
> > > ---
> > >   drivers/infiniband/core/device.c | 8 +-------
> > >   1 file changed, 1 insertion(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> > > index 67bcea7a153c..85782786993d 100644
> > > --- a/drivers/infiniband/core/device.c
> > > +++ b/drivers/infiniband/core/device.c
> > > @@ -1315,12 +1315,6 @@ static int enable_device_and_get(struct ib_device *device)
> > >   	down_write(&devices_rwsem);
> > >   	xa_set_mark(&devices, device->index, DEVICE_REGISTERED);
> > > -	/*
> > > -	 * By using downgrade_write() we ensure that no other thread can clear
> > > -	 * DEVICE_REGISTERED while we are completing the client setup.
> > > -	 */
> > > -	downgrade_write(&devices_rwsem);
> > > -
> > >   	if (device->ops.enable_driver) {
> > >   		ret = device->ops.enable_driver(device);
> > >   		if (ret)
> > > @@ -1337,7 +1331,7 @@ static int enable_device_and_get(struct ib_device *device)
> > >   	if (!ret)
> > >   		ret = add_compat_devs(device);
> > >   out:
> > > -	up_read(&devices_rwsem);
> > > +	up_write(&devices_rwsem);
> > >   	return ret;
> > >   }
> > > -- 
> > > 2.25.1
> > > 
> > > 
> > 
> 
> 

  reply	other threads:[~2024-01-02 11:58 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-02  3:43 [PATCH] RDMA/device: Fix a race between mad_client and cm_client init Shifeng Li
2024-01-02  8:58 ` Leon Romanovsky
2024-01-02 11:33   ` Shifeng Li
2024-01-02 11:58     ` Leon Romanovsky [this message]
2024-01-02 12:30       ` Shifeng Li
2024-01-03 18:48 ` Jason Gunthorpe
     [not found]   ` <80cac9fd-7fed-403e-8889-78e2fc7a49b0@sangfor.com.cn>
2024-01-04 12:37     ` Jason Gunthorpe
2024-01-05  8:15       ` Shifeng Li
2024-01-05 14:19         ` Jason Gunthorpe
2024-01-15  3:27           ` Shifeng Li
2024-01-06  2:12       ` Ding Hui
2024-01-15 13:47         ` Jason Gunthorpe
     [not found]           ` <354e2bf7-a8b4-629d-3d2d-35951a52e8bd@sangfor.com.cn>
2024-01-26  2:25             ` Shifeng Li
2024-02-01  0:04               ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240102115806.GH6361@unreal \
    --to=leon@kernel.org \
    --cc=gustavoars@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=lishifeng1992@126.com \
    --cc=lishifeng@sangfor.com.cn \
    --cc=wenglianfa@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.