* [PATCH] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup
@ 2025-06-10 9:28 Arnd Bergmann
2025-06-10 9:50 ` Patrisious Haddad
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Arnd Bergmann @ 2025-06-10 9:28 UTC (permalink / raw)
To: Leon Romanovsky, Jason Gunthorpe, Patrisious Haddad
Cc: Arnd Bergmann, Christian Göttsche, Serge Hallyn,
Chiara Meiohas, Al Viro, linux-rdma, linux-kernel
From: Arnd Bergmann <arnd@arndb.de>
This function has an array of eight mlx5_async_cmd structures, which
often fits on the stack, but depending on the configuration can
end up blowing the stack frame warning limit:
drivers/infiniband/hw/mlx5/devx.c:2670:6: error: stack frame size (1392) exceeds limit (1280) in 'mlx5_ib_ufile_hw_cleanup' [-Werror,-Wframe-larger-than]
Change this to a dynamic allocation instead. While a kmalloc()
can theoretically fail, a GFP_KERNEL allocation under a page will
block until memory has been freed up, so in the worst case, this
only adds extra time in an already constrained environment.
Fixes: 7c891a4dbcc1 ("RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
drivers/infiniband/hw/mlx5/devx.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mlx5/devx.c b/drivers/infiniband/hw/mlx5/devx.c
index 2479da8620ca..c3c0ea219ab7 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -2669,7 +2669,7 @@ static void devx_wait_async_destroy(struct mlx5_async_cmd *cmd)
void mlx5_ib_ufile_hw_cleanup(struct ib_uverbs_file *ufile)
{
- struct mlx5_async_cmd async_cmd[MAX_ASYNC_CMDS];
+ struct mlx5_async_cmd *async_cmd;
struct ib_ucontext *ucontext = ufile->ucontext;
struct ib_device *device = ucontext->device;
struct mlx5_ib_dev *dev = to_mdev(device);
@@ -2678,6 +2678,10 @@ void mlx5_ib_ufile_hw_cleanup(struct ib_uverbs_file *ufile)
int head = 0;
int tail = 0;
+ async_cmd = kcalloc(MAX_ASYNC_CMDS, sizeof(*async_cmd), GFP_KERNEL);
+ if (WARN_ON(!async_cmd))
+ return;
+
list_for_each_entry(uobject, &ufile->uobjects, list) {
WARN_ON(uverbs_try_lock_object(uobject, UVERBS_LOOKUP_WRITE));
@@ -2713,6 +2717,8 @@ void mlx5_ib_ufile_hw_cleanup(struct ib_uverbs_file *ufile)
devx_wait_async_destroy(&async_cmd[head % MAX_ASYNC_CMDS]);
head++;
}
+
+ kfree(async_cmd);
}
static ssize_t devx_async_cmd_event_read(struct file *filp, char __user *buf,
--
2.39.5
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup
2025-06-10 9:28 [PATCH] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup Arnd Bergmann
@ 2025-06-10 9:50 ` Patrisious Haddad
2025-06-10 10:31 ` Arnd Bergmann
2025-06-12 9:05 ` Leon Romanovsky
2025-06-12 9:05 ` Leon Romanovsky
2 siblings, 1 reply; 6+ messages in thread
From: Patrisious Haddad @ 2025-06-10 9:50 UTC (permalink / raw)
To: Arnd Bergmann, Leon Romanovsky, Jason Gunthorpe
Cc: Arnd Bergmann, Christian Göttsche, Serge Hallyn,
Chiara Meiohas, Al Viro, linux-rdma, linux-kernel
On 6/10/2025 12:28 PM, Arnd Bergmann wrote:
> External email: Use caution opening links or attachments
>
>
> From: Arnd Bergmann <arnd@arndb.de>
>
> This function has an array of eight mlx5_async_cmd structures, which
> often fits on the stack, but depending on the configuration can
> end up blowing the stack frame warning limit:
>
> drivers/infiniband/hw/mlx5/devx.c:2670:6: error: stack frame size (1392) exceeds limit (1280) in 'mlx5_ib_ufile_hw_cleanup' [-Werror,-Wframe-larger-than]
>
> Change this to a dynamic allocation instead. While a kmalloc()
> can theoretically fail, a GFP_KERNEL allocation under a page will
> block until memory has been freed up, so in the worst case, this
> only adds extra time in an already constrained environment.
>
> Fixes: 7c891a4dbcc1 ("RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> drivers/infiniband/hw/mlx5/devx.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/hw/mlx5/devx.c b/drivers/infiniband/hw/mlx5/devx.c
> index 2479da8620ca..c3c0ea219ab7 100644
> --- a/drivers/infiniband/hw/mlx5/devx.c
> +++ b/drivers/infiniband/hw/mlx5/devx.c
> @@ -2669,7 +2669,7 @@ static void devx_wait_async_destroy(struct mlx5_async_cmd *cmd)
>
> void mlx5_ib_ufile_hw_cleanup(struct ib_uverbs_file *ufile)
> {
> - struct mlx5_async_cmd async_cmd[MAX_ASYNC_CMDS];
> + struct mlx5_async_cmd *async_cmd;
Please preserve reverse Christmas tree deceleration.
> struct ib_ucontext *ucontext = ufile->ucontext;
> struct ib_device *device = ucontext->device;
> struct mlx5_ib_dev *dev = to_mdev(device);
> @@ -2678,6 +2678,10 @@ void mlx5_ib_ufile_hw_cleanup(struct ib_uverbs_file *ufile)
> int head = 0;
> int tail = 0;
>
> + async_cmd = kcalloc(MAX_ASYNC_CMDS, sizeof(*async_cmd), GFP_KERNEL);
> + if (WARN_ON(!async_cmd))
> + return;
But honestly I'm not sure I like this, the whole point of this patch was
performance optimization for teardown flow, and this function is called
in a loop not even one time.
So I'm really not sure about how much kcalloc can slow it down here, and
it failing is whole other issue.
I'm thinking out-loud here, but theoretically we know stack size and
this struct size at compile time , so can we should be able to add some
kind of ifdef check "if (stack_frame_size < struct_size)" skip this
function and maybe print some warning.
(since it is purely optimization function and logically the code will
continue correctly without it - but if it needs to be executed then let
it stay like this and needs a big enough stack - which is most of today
systems anyway) ?
> +
> list_for_each_entry(uobject, &ufile->uobjects, list) {
> WARN_ON(uverbs_try_lock_object(uobject, UVERBS_LOOKUP_WRITE));
>
> @@ -2713,6 +2717,8 @@ void mlx5_ib_ufile_hw_cleanup(struct ib_uverbs_file *ufile)
> devx_wait_async_destroy(&async_cmd[head % MAX_ASYNC_CMDS]);
> head++;
> }
> +
> + kfree(async_cmd);
> }
>
> static ssize_t devx_async_cmd_event_read(struct file *filp, char __user *buf,
> --
> 2.39.5
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup
2025-06-10 9:50 ` Patrisious Haddad
@ 2025-06-10 10:31 ` Arnd Bergmann
2025-06-10 14:51 ` Patrisious Haddad
0 siblings, 1 reply; 6+ messages in thread
From: Arnd Bergmann @ 2025-06-10 10:31 UTC (permalink / raw)
To: Patrisious Haddad, Arnd Bergmann, Leon Romanovsky,
Jason Gunthorpe
Cc: Christian Göttsche, Serge E. Hallyn, Chiara Meiohas,
Alexander Viro, linux-rdma, linux-kernel
On Tue, Jun 10, 2025, at 11:50, Patrisious Haddad wrote:
> On 6/10/2025 12:28 PM, Arnd Bergmann wrote:
>> void mlx5_ib_ufile_hw_cleanup(struct ib_uverbs_file *ufile)
>> {
>> - struct mlx5_async_cmd async_cmd[MAX_ASYNC_CMDS];
>> + struct mlx5_async_cmd *async_cmd;
> Please preserve reverse Christmas tree deceleration.
>> struct ib_ucontext *ucontext = ufile->ucontext;
>> struct ib_device *device = ucontext->device;
>> struct mlx5_ib_dev *dev = to_mdev(device);
>> @@ -2678,6 +2678,10 @@ void mlx5_ib_ufile_hw_cleanup(struct ib_uverbs_file *ufile)
>> int head = 0;
>> int tail = 0;
>>
>> + async_cmd = kcalloc(MAX_ASYNC_CMDS, sizeof(*async_cmd), GFP_KERNEL);
>> + if (WARN_ON(!async_cmd))
>> + return;
>
> But honestly I'm not sure I like this, the whole point of this patch was
> performance optimization for teardown flow, and this function is called
> in a loop not even one time.
>
> So I'm really not sure about how much kcalloc can slow it down here, and
> it failing is whole other issue.
Generally speaking, kcalloc is fairly quick and won't fail here,
but it can take some time under memory pressure if it ends up
in memory reclaim.
> I'm thinking out-loud here, but theoretically we know stack size and
> this struct size at compile time , so can we should be able to add some
> kind of ifdef check "if (stack_frame_size < struct_size)" skip this
> function and maybe print some warning.
> (since it is purely optimization function and logically the code will
> continue correctly without it - but if it needs to be executed then let
> it stay like this and needs a big enough stack - which is most of today
> systems anyway) ?
The thing I'm most interested here is the compile-time warning:
we currently have some configurations that have a very high warning
limit of 2048 bytes or even unlimited, which means that a number
of functions that accidentally use too much stack space (either from
a compiler misoptimization or a programmer error) are missed and
can end up causing problems later. I posted this patch as part of
a larger work to eventually reduce the default warning limit
for those corner cases.
The risk in this particular function to actually overflow is fairly
low since it gets called from sys_close() or __fput(), which
are not nested deeply. I can think of a couple of other ways to
keep your fast path and also build cleanly with a lower warning
limit.
- check which exact configurations actually trigger the high stack
usage and then skip the optimization in those cases. The most
likely causes are CONFIG_KASAN_STACK and CONFIG_KMSAN, both
of which already make the kernel a lot slower.
- reduce MAX_ASYNC_CMDS to always stay under the warning limit, either
picking a lower value unconditionally, or based on the Kconfig
options that trigger it
- preallocate the array as part of an existing structure, whichever
makes sense here (mlx5_ib_dev maybe?).
- reorganize the code in some other form to have the stack not
blow the warning limit. As far as I can tell, I only see this
particular one with clang but not gcc, and that often means
it happens because of some particular inlining decisions that
clang takes, and we can force them by adding strategic
__always_inline or noinline annotations that make both compilers
do the same thing.
Arnd
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup
2025-06-10 10:31 ` Arnd Bergmann
@ 2025-06-10 14:51 ` Patrisious Haddad
0 siblings, 0 replies; 6+ messages in thread
From: Patrisious Haddad @ 2025-06-10 14:51 UTC (permalink / raw)
To: Arnd Bergmann, Arnd Bergmann, Leon Romanovsky, Jason Gunthorpe
Cc: Christian Göttsche, Serge E. Hallyn, Chiara Meiohas,
Alexander Viro, linux-rdma, linux-kernel
On 6/10/2025 1:31 PM, Arnd Bergmann wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, Jun 10, 2025, at 11:50, Patrisious Haddad wrote:
>> On 6/10/2025 12:28 PM, Arnd Bergmann wrote:
>>> void mlx5_ib_ufile_hw_cleanup(struct ib_uverbs_file *ufile)
>>> {
>>> - struct mlx5_async_cmd async_cmd[MAX_ASYNC_CMDS];
>>> + struct mlx5_async_cmd *async_cmd;
>> Please preserve reverse Christmas tree deceleration.
>>> struct ib_ucontext *ucontext = ufile->ucontext;
>>> struct ib_device *device = ucontext->device;
>>> struct mlx5_ib_dev *dev = to_mdev(device);
>>> @@ -2678,6 +2678,10 @@ void mlx5_ib_ufile_hw_cleanup(struct ib_uverbs_file *ufile)
>>> int head = 0;
>>> int tail = 0;
>>>
>>> + async_cmd = kcalloc(MAX_ASYNC_CMDS, sizeof(*async_cmd), GFP_KERNEL);
>>> + if (WARN_ON(!async_cmd))
>>> + return;
>> But honestly I'm not sure I like this, the whole point of this patch was
>> performance optimization for teardown flow, and this function is called
>> in a loop not even one time.
>>
>> So I'm really not sure about how much kcalloc can slow it down here, and
>> it failing is whole other issue.
> Generally speaking, kcalloc is fairly quick and won't fail here,
> but it can take some time under memory pressure if it ends up
> in memory reclaim.
>
>> I'm thinking out-loud here, but theoretically we know stack size and
>> this struct size at compile time , so can we should be able to add some
>> kind of ifdef check "if (stack_frame_size < struct_size)" skip this
>> function and maybe print some warning.
>> (since it is purely optimization function and logically the code will
>> continue correctly without it - but if it needs to be executed then let
>> it stay like this and needs a big enough stack - which is most of today
>> systems anyway) ?
> The thing I'm most interested here is the compile-time warning:
> we currently have some configurations that have a very high warning
> limit of 2048 bytes or even unlimited, which means that a number
> of functions that accidentally use too much stack space (either from
> a compiler misoptimization or a programmer error) are missed and
> can end up causing problems later. I posted this patch as part of
> a larger work to eventually reduce the default warning limit
> for those corner cases.
>
> The risk in this particular function to actually overflow is fairly
> low since it gets called from sys_close() or __fput(), which
> are not nested deeply. I can think of a couple of other ways to
> keep your fast path and also build cleanly with a lower warning
> limit.
>
> - check which exact configurations actually trigger the high stack
> usage and then skip the optimization in those cases. The most
> likely causes are CONFIG_KASAN_STACK and CONFIG_KMSAN, both
> of which already make the kernel a lot slower.
Personally I prefer this option the most.
But If I were you I would wait to hear if the maintainers got a problem
with that approach ...
>
> - reduce MAX_ASYNC_CMDS to always stay under the warning limit, either
> picking a lower value unconditionally, or based on the Kconfig
> options that trigger it
No the number 8 wasn't chosen arbitrarily it also due to performance
reasons, whereas note that it is also the number
of commands that can be sent in parallel for destruction so reducing it
isn't ideal.
>
> - preallocate the array as part of an existing structure, whichever
> makes sense here (mlx5_ib_dev maybe?).
Can work but not ideal.
>
> - reorganize the code in some other form to have the stack not
> blow the warning limit. As far as I can tell, I only see this
> particular one with clang but not gcc, and that often means
> it happens because of some particular inlining decisions that
> clang takes, and we can force them by adding strategic
> __always_inline or noinline annotations that make both compilers
> do the same thing.
Sounds like the hardest option to implement but I have no quarrel with it.
>
> Arnd
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup
2025-06-10 9:28 [PATCH] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup Arnd Bergmann
2025-06-10 9:50 ` Patrisious Haddad
@ 2025-06-12 9:05 ` Leon Romanovsky
2025-06-12 9:05 ` Leon Romanovsky
2 siblings, 0 replies; 6+ messages in thread
From: Leon Romanovsky @ 2025-06-12 9:05 UTC (permalink / raw)
To: Arnd Bergmann, Patrisious Haddad
Cc: Jason Gunthorpe, Arnd Bergmann, Christian Göttsche,
Serge Hallyn, Chiara Meiohas, Al Viro, linux-rdma, linux-kernel
On Tue, Jun 10, 2025 at 11:28:42AM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
>
> This function has an array of eight mlx5_async_cmd structures, which
> often fits on the stack, but depending on the configuration can
> end up blowing the stack frame warning limit:
>
> drivers/infiniband/hw/mlx5/devx.c:2670:6: error: stack frame size (1392) exceeds limit (1280) in 'mlx5_ib_ufile_hw_cleanup' [-Werror,-Wframe-larger-than]
>
> Change this to a dynamic allocation instead. While a kmalloc()
> can theoretically fail, a GFP_KERNEL allocation under a page will
> block until memory has been freed up, so in the worst case, this
> only adds extra time in an already constrained environment.
>
> Fixes: 7c891a4dbcc1 ("RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> drivers/infiniband/hw/mlx5/devx.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
<...>
> + async_cmd = kcalloc(MAX_ASYNC_CMDS, sizeof(*async_cmd), GFP_KERNEL);
> + if (WARN_ON(!async_cmd))
> + return;
Patrisious,
I took this patch for now (without WARN_ON) as it fixes Arnd's issue.
We can provide followup patch if kcalloc() implementation hurts us.
Thanks
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup
2025-06-10 9:28 [PATCH] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup Arnd Bergmann
2025-06-10 9:50 ` Patrisious Haddad
2025-06-12 9:05 ` Leon Romanovsky
@ 2025-06-12 9:05 ` Leon Romanovsky
2 siblings, 0 replies; 6+ messages in thread
From: Leon Romanovsky @ 2025-06-12 9:05 UTC (permalink / raw)
To: Jason Gunthorpe, Patrisious Haddad, Arnd Bergmann
Cc: Arnd Bergmann, Christian Göttsche, Serge Hallyn,
Chiara Meiohas, Al Viro, linux-rdma, linux-kernel
On Tue, 10 Jun 2025 11:28:42 +0200, Arnd Bergmann wrote:
> This function has an array of eight mlx5_async_cmd structures, which
> often fits on the stack, but depending on the configuration can
> end up blowing the stack frame warning limit:
>
> drivers/infiniband/hw/mlx5/devx.c:2670:6: error: stack frame size (1392) exceeds limit (1280) in 'mlx5_ib_ufile_hw_cleanup' [-Werror,-Wframe-larger-than]
>
> Change this to a dynamic allocation instead. While a kmalloc()
> can theoretically fail, a GFP_KERNEL allocation under a page will
> block until memory has been freed up, so in the worst case, this
> only adds extra time in an already constrained environment.
>
> [...]
Applied, thanks!
[1/1] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup
https://git.kernel.org/rdma/rdma/c/b26852daaa83f5
Best regards,
--
Leon Romanovsky <leon@kernel.org>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-06-12 9:17 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-10 9:28 [PATCH] RDMA/mlx5: reduce stack usage in mlx5_ib_ufile_hw_cleanup Arnd Bergmann
2025-06-10 9:50 ` Patrisious Haddad
2025-06-10 10:31 ` Arnd Bergmann
2025-06-10 14:51 ` Patrisious Haddad
2025-06-12 9:05 ` Leon Romanovsky
2025-06-12 9:05 ` Leon Romanovsky
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).