Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed
* Re: Segfault in mlx5 driver on infiniband after application fork
@ 2024-02-11 19:24 Kevan Rehm
  2024-02-12 13:33 ` Jason Gunthorpe
  0 siblings, 1 reply; 14+ messages in thread
From: Kevan Rehm @ 2024-02-11 19:24 UTC (permalink / raw)
  To: Mark Zhang, Leon Romanovsky
  Cc: linux-rdma@vger.kernel.org, Yishai Hadas, kevan.rehm


>> An application started by pytorch does a fork, then the child process attempts to use libfabric to open a new DAOS infiniband endpoint.    The original endpoint is owned and still in use by the parent process.
>>
>> When the parent process created the endpoint (fi_fabric, fi_domain, fi_endpoint calls), the mlx5 driver allocated memory pages for use in SRQ creation, and issued a madvise to say that the pages are DONTFORK.  These pages are associated with the domain’sibv_device which is cached in the driver.   After the fork when the child process calls fi_domain for its new endpoint, it gets the ibv_device that was cached at the time it was created by the parent.   The child process immediately segfaults when trying to create a SRQ, because the pages associated with that ibv_device are not in the child’s memory.  There doesn’t appear to be any way for a child process to create a fresh endpoint because of the caching being done for ibv_devices.
>>

> For anyone who is interested in this issue, please follow the links below:
> https://github.com/ofiwg/libfabric/issues/9792
> https://daosio.atlassian.net/browse/DAOS-15117
> 
> Regarding the issue, I don't know if mlx5 actively used to run
> libfabric, but the mentioned call to ibv_dontfork_range() existed from
> prehistoric era.

Yes, libfabric has used mlx5 for a long time.

> Do you have any environment variables set related to rdma-core?
> 
IBV_FORK_SAFE is set to 1

> Is it reated to ibv_fork_init()? It must be called when fork() is called.

Calling ibv_fork_init() doesn’t help, because it immediately checks mm_root, sees it is non-zero (from the parent process’s prior call), and returns doing nothing.
There is now a simplified test case, see https://github.com/ofiwg/libfabric/issues/9792 for ongoing analysis.

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Segfault in mlx5 driver on infiniband after application fork
@ 2024-02-21 12:51 Kevan Rehm
  0 siblings, 0 replies; 14+ messages in thread
From: Kevan Rehm @ 2024-02-21 12:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma@vger.kernel.org,
	Yishai Hadas, Kevan Rehm, chien.tin.tung@intel.com, Kevan Rehm

I posted PR #1431 for this.   I tested with IBV_FORK_SAFE and RDMA_FORK_SAFE set and unset.  Also with UCX_IB_FORK_INIT unset, set to no, and set to yes.   All combos work correctly without segfault.

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Re: Segfault in mlx5 driver on infiniband after application fork
@ 2024-02-13 16:45 Kevan Rehm
  0 siblings, 0 replies; 14+ messages in thread
From: Kevan Rehm @ 2024-02-13 16:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mark Zhang, Leon Romanovsky, linux-rdma, Yishai Hadas,
	chien.tin.tung, kevan.rehm

Newer kernels are detected and disable the DONT_FORK calls in verbs.
> 
> rdma-core support is present since:
> 
> commit 67b00c3835a3480a035a9e1bcf5695f5c0e8568e
> Author: Gal Pressman <galpress@amazon.com>
> Date:   Sun Apr 4 17:24:54 2021 +0300
> 
>    verbs: Report when ibv_fork_init() is not needed
> 
>    Identify kernels which do not require ibv_fork_init() to be called and
>    report it through the ibv_is_fork_initialized() verb.
> 
>    The feature detection is done through a new read-only attribute in the
>    get sys netlink command. If the attribute is not reported, assume old
>    kernel without COF support. If the attribute is reported, use the
>    returned value.
> 
>    This allows ibv_is_fork_initialized() to return the previously unused
>    IBV_FORK_UNNEEDED value, which takes precedence over the
>    DISABLED/ENABLED values. Meaning that if the kernel does not require a
>    call to ibv_fork_init(), IBV_FORK_UNNEEDED will be returned regardless
>    of whether ibv_fork_init() was called or not.
> 
>    Signed-off-by: Gal Pressman <galpress@amazon.com>
> 
> The kernel support was in v5.13-rc1~78^2~1
> 
> And backported in a few cases.
> 
> Jason

The above info was immensely helpful, and I am running MOFED 23.10-OFED.23.10.0.5.5.1 so my kernel already has the fork improvements.  However, there are still issues, as the above requires all callers to check ibv_is_fork_initialized() before every call to ibv_fork_init.  Not everyone does this.

Routine ibv_get_device() unconditionally calls ibverbs_init() on the first call, and that routine calls ibv_fork_init() if either RDMA_FORK_SAFE or IBV_FORK_SAFE are set, even if the kernel has the fork enhancements.  I wrapped that check with a call to ibv_is_fork_initialized, and skipped the ibv_fork_init() call if IBV_FORK_UNNEEDED was returned.  This caused my little test program to run successfully, but the original benchmark still bombed.

The benchmark uses MPI.  It turns out that mpi4py calls PMPI_Init() which eventually makes UCX calls, and routine uct_ib_md_open() in UCX calls ibv_fork_init() without first calling ibv_is_fork_initialized.  It’s looking at some md_config->fork_init variable, not checking the kernel support.    In order to cover all potential cases, I changed my rdma patch to instead call ibv_is_fork_initialized() inside ibv_fork_init() itself, and return 0 without creating mm_root if kernel support is there.   This causes MPI and the original benchmark to work.

Is this a reasonable fix that could be added to rdma?

[root@delphi-029 libibverbs]# diff -C 5 memory.c.orig memory.c
*** memory.c.orig 2024-02-13 09:45:28.078997178 -0600
--- memory.c 2024-02-13 09:27:46.901699958 -0600
***************
*** 140,149 ****
--- 140,152 ----
huge_page_enabled = 1;

if (mm_root)
return 0;

+ if (ibv_is_fork_initialized() == IBV_FORK_UNNEEDED)
+ return 0;
+
if (too_late)
return EINVAL;

fprintf(stderr, "ibv_fork_init creating mm_root\n");
page_size = sysconf(_SC_PAGESIZE);

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Segfault in mlx5 driver on infiniband after application fork
@ 2024-02-07 19:17 Rehm, Kevan
  2024-02-08  8:52 ` Leon Romanovsky
  0 siblings, 1 reply; 14+ messages in thread
From: Rehm, Kevan @ 2024-02-07 19:17 UTC (permalink / raw)
  To: linux-rdma@vger.kernel.org

Greetings,
 
I don’t see a way to open a ticket at rdma-core; it was suggested that I send this email instead.
 
I have been chasing a problem in rdma-core-47.1.   Originally, I opened a ticket in libfabric, but it was pointed out that mlx5 is not part of libfabric.   Full description of the problem plus debug notes are documented at the github repository for libfabric, see issue 9792, please have a look there rather than repeating all of the background information in this email.
 
An application started by pytorch does a fork, then the child process attempts to use libfabric to open a new DAOS infiniband endpoint.    The original endpoint is owned and still in use by the parent process. 
 
When the parent process created the endpoint (fi_fabric, fi_domain, fi_endpoint calls), the mlx5 driver allocated memory pages for use in SRQ creation, and issued a madvise to say that the pages are DONTFORK.  These pages are associated with the domain’s ibv_device which is cached in the driver.   After the fork when the child process calls fi_domain for its new endpoint, it gets the ibv_device that was cached at the time it was created by the parent.   The child process immediately segfaults when trying to create a SRQ, because the pages associated with that ibv_device are not in the child’s memory.  There doesn’t appear to be any way for a child process to create a fresh endpoint because of the caching being done for ibv_devices.
 
Is this the proper way to “open a ticket” against rdma-core?
 
Regards, Kevan




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-02-21 12:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-11 19:24 Segfault in mlx5 driver on infiniband after application fork Kevan Rehm
2024-02-12 13:33 ` Jason Gunthorpe
2024-02-12 14:37   ` Kevan Rehm
2024-02-12 14:40     ` Jason Gunthorpe
2024-02-12 16:04       ` Kevan Rehm
2024-02-12 16:12         ` Jason Gunthorpe
2024-02-12 16:37           ` Kevan Rehm
2024-02-12 16:45             ` Jason Gunthorpe
2024-02-16 19:56               ` Kevan Rehm
  -- strict thread matches above, loose matches on Subject: below --
2024-02-21 12:51 Kevan Rehm
2024-02-13 16:45 Kevan Rehm
2024-02-07 19:17 Rehm, Kevan
2024-02-08  8:52 ` Leon Romanovsky
2024-02-08  9:05   ` Mark Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox