[PATCH 0/1] RDMA/mlx5: Release CPU for other processes in mlx5_free_cmd

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/1] RDMA/mlx5: Release CPU for other processes in mlx5_free_cmd_msg()
@ 2024-05-22  3:32 Anand Khoje
  2024-05-22  3:32 ` [PATCH 1/1] " Anand Khoje
  0 siblings, 1 reply; 7+ messages in thread
From: Anand Khoje @ 2024-05-22  3:32 UTC (permalink / raw)
  To: linux-rdma, linux-kernel
  Cc: anand.a.khoje, rama.nichanamatlu, manjunath.b.patil

FW pages are allocated and reclaimed through a worker pages_handler().  This
worker allocates mlx5 mailbox messages to populate meta-data associated with the
pages being allocated or reclaimed.  During reclaim of pages, after getting the
meta-data of the pages to reclaim and releasing the pages, the dma pool
associated with the mailbox message is freed using dma_pool_free(), where it
tried to find the dma_page of this dma_pool by walking the page_list.  This is a
slow approach and if the number of pages reclaimed is high, it takes a lot of
time in execution of one work.  As a result, other critical processes are
starved of CPU.

This patch checks if time spent in mlx5_free_cmd_msg() is more than 2 msec, it
yields the CPU for other processes to use.

In our tests, we were able to allocate around 3.4 million FW pages and tried to
deallocate all of them at once, this resulted in the worker thread to yield the
CPU many times.

May 21 04:39:28 kernel: mlx5_core 0000:17:00.0:
mlx5_free_cmd_msg:1352:(pid 327407): Spent more than 2 msecs, yielding CPU
May 21 04:39:28 kernel: mlx5_core 0000:17:00.0:
mlx5_free_cmd_msg:1352:(pid 327407): Spent more than 2 msecs, yielding CPU
May 21 04:39:29 kernel: mlx5_core 0000:17:00.0:
mlx5_free_cmd_msg:1352:(pid 327407): Spent more than 2 msecs, yielding CPU
May 21 04:39:29 kernel: mlx5_core 0000:17:00.0:
mlx5_free_cmd_msg:1352:(pid 327407): Spent more than 2 msecs, yielding CPU

Anand Khoje (1): RDMA/mlx5: Release CPU for other processes in
mlx5_free_cmd_msg()

 drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 7 +++++++ 1 file changed, 7
insertions(+)

-- 1.8.3.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/1] RDMA/mlx5: Release CPU for other processes in mlx5_free_cmd_msg()
  2024-05-22  3:32 [PATCH 0/1] RDMA/mlx5: Release CPU for other processes in mlx5_free_cmd_msg() Anand Khoje
@ 2024-05-22  3:32 ` Anand Khoje
  2024-05-26 15:23   ` Shay Drori
  2024-05-30 17:14   ` Leon Romanovsky
  0 siblings, 2 replies; 7+ messages in thread
From: Anand Khoje @ 2024-05-22  3:32 UTC (permalink / raw)
  To: linux-rdma, linux-kernel
  Cc: anand.a.khoje, rama.nichanamatlu, manjunath.b.patil

In non FLR context, at times CX-5 requests release of ~8 million device pages.
This needs humongous number of cmd mailboxes, which to be released once
the pages are reclaimed. Release of humongous number of cmd mailboxes
consuming cpu time running into many secs, with non preemptable kernels
is leading to critical process starving on that cpu’s RQ. To alleviate
this, this patch relinquishes cpu periodically but conditionally.

Orabug: 36275016

Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index 9c21bce..9fbf25d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -1336,16 +1336,23 @@ static struct mlx5_cmd_msg *mlx5_alloc_cmd_msg(struct mlx5_core_dev *dev,
 	return ERR_PTR(err);
 }
 
+#define RESCHED_MSEC 2
 static void mlx5_free_cmd_msg(struct mlx5_core_dev *dev,
 			      struct mlx5_cmd_msg *msg)
 {
 	struct mlx5_cmd_mailbox *head = msg->next;
 	struct mlx5_cmd_mailbox *next;
+	unsigned long start_time = jiffies;
 
 	while (head) {
 		next = head->next;
 		free_cmd_box(dev, head);
 		head = next;
+		if (time_after(jiffies, start_time + msecs_to_jiffies(RESCHED_MSEC))) {
+			mlx5_core_warn_rl(dev, "Spent more than %d msecs, yielding CPU\n", RESCHED_MSEC);
+			cond_resched();
+			start_time = jiffies;
+		}
 	}
 	kfree(msg);
 }
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] RDMA/mlx5: Release CPU for other processes in mlx5_free_cmd_msg()
  2024-05-22  3:32 ` [PATCH 1/1] " Anand Khoje
@ 2024-05-26 15:23   ` Shay Drori
  2024-05-29 12:01     ` Anand Khoje
  2024-05-30 17:14   ` Leon Romanovsky
  1 sibling, 1 reply; 7+ messages in thread
From: Shay Drori @ 2024-05-26 15:23 UTC (permalink / raw)
  To: Anand Khoje, linux-rdma, linux-kernel, moshe
  Cc: rama.nichanamatlu, manjunath.b.patil, netdev@vger.kernel.org

Hi Anand.

First, the correct Mailing list for this patch is
netdev@vger.kernel.org, please send there the next version.

On 22/05/2024 6:32, Anand Khoje wrote:
> In non FLR context, at times CX-5 requests release of ~8 million device pages.
> This needs humongous number of cmd mailboxes, which to be released once
> the pages are reclaimed. Release of humongous number of cmd mailboxes
> consuming cpu time running into many secs, with non preemptable kernels
> is leading to critical process starving on that cpu’s RQ. To alleviate
> this, this patch relinquishes cpu periodically but conditionally.
> 
> Orabug: 36275016

this doesn't seem relevant

> 
> Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> index 9c21bce..9fbf25d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> @@ -1336,16 +1336,23 @@ static struct mlx5_cmd_msg *mlx5_alloc_cmd_msg(struct mlx5_core_dev *dev,
>   	return ERR_PTR(err);
>   }
>   
> +#define RESCHED_MSEC 2


What if you add cond_resched() on every iteration of the loop ? Does it
take much more time to finish 8 Million pages or same ?
If it does matter, maybe 2 ms is too high freq ? 20 ms ? 200 ms ?

Thanks

>   static void mlx5_free_cmd_msg(struct mlx5_core_dev *dev,
>   			      struct mlx5_cmd_msg *msg)
>   {
>   	struct mlx5_cmd_mailbox *head = msg->next;
>   	struct mlx5_cmd_mailbox *next;
> +	unsigned long start_time = jiffies;
>   
>   	while (head) {
>   		next = head->next;
>   		free_cmd_box(dev, head);
>   		head = next;
> +		if (time_after(jiffies, start_time + msecs_to_jiffies(RESCHED_MSEC))) {
> +			mlx5_core_warn_rl(dev, "Spent more than %d msecs, yielding CPU\n", RESCHED_MSEC);
> +			cond_resched();
> +			start_time = jiffies;
> +		}
>   	}
>   	kfree(msg);
>   }

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] RDMA/mlx5: Release CPU for other processes in mlx5_free_cmd_msg()
  2024-05-26 15:23   ` Shay Drori
@ 2024-05-29 12:01     ` Anand Khoje
  0 siblings, 0 replies; 7+ messages in thread
From: Anand Khoje @ 2024-05-29 12:01 UTC (permalink / raw)
  To: Shay Drori, linux-rdma, linux-kernel, moshe
  Cc: rama.nichanamatlu, manjunath.b.patil, netdev@vger.kernel.org


On 5/26/24 20:53, Shay Drori wrote:
> Hi Anand.
>
> First, the correct Mailing list for this patch is
> netdev@vger.kernel.org, please send there the next version.
>
> On 22/05/2024 6:32, Anand Khoje wrote:
>> In non FLR context, at times CX-5 requests release of ~8 million 
>> device pages.
>> This needs humongous number of cmd mailboxes, which to be released once
>> the pages are reclaimed. Release of humongous number of cmd mailboxes
>> consuming cpu time running into many secs, with non preemptable kernels
>> is leading to critical process starving on that cpu’s RQ. To alleviate
>> this, this patch relinquishes cpu periodically but conditionally.
>>
>> Orabug: 36275016
>
> this doesn't seem relevant
>
>>
>> Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c 
>> b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
>> index 9c21bce..9fbf25d 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
>> @@ -1336,16 +1336,23 @@ static struct mlx5_cmd_msg 
>> *mlx5_alloc_cmd_msg(struct mlx5_core_dev *dev,
>>       return ERR_PTR(err);
>>   }
>>   +#define RESCHED_MSEC 2
>
>
> What if you add cond_resched() on every iteration of the loop ? Does it
> take much more time to finish 8 Million pages or same ?
> If it does matter, maybe 2 ms is too high freq ? 20 ms ? 200 ms ?
>
Shay,


There is no rule we could use, but can use only guidance/suggestions here.
Delay if too short/often relinquish leads to thrashing and high context 
switch costs,
while keeping it long/infrequent relinquish leads to RQ starvation.
This observation is based  on our applications / workload, using which a 
middle ground was chosen as 2 msecs.
But your suggestions are also very viable. Hence we are reconsidering it.

This was very helpful. thank you! I will resend a v2 after more testing.

Thanks,

Anand


> Thanks
>
>>   static void mlx5_free_cmd_msg(struct mlx5_core_dev *dev,
>>                     struct mlx5_cmd_msg *msg)
>>   {
>>       struct mlx5_cmd_mailbox *head = msg->next;
>>       struct mlx5_cmd_mailbox *next;
>> +    unsigned long start_time = jiffies;
>>         while (head) {
>>           next = head->next;
>>           free_cmd_box(dev, head);
>>           head = next;
>> +        if (time_after(jiffies, start_time + 
>> msecs_to_jiffies(RESCHED_MSEC))) {
>> +            mlx5_core_warn_rl(dev, "Spent more than %d msecs, 
>> yielding CPU\n", RESCHED_MSEC);
>> +            cond_resched();
>> +            start_time = jiffies;
>> +        }
>>       }
>>       kfree(msg);
>>   }

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] RDMA/mlx5: Release CPU for other processes in mlx5_free_cmd_msg()
  2024-05-22  3:32 ` [PATCH 1/1] " Anand Khoje
  2024-05-26 15:23   ` Shay Drori
@ 2024-05-30 17:14   ` Leon Romanovsky
  2024-05-31  4:51     ` Anand Khoje
  1 sibling, 1 reply; 7+ messages in thread
From: Leon Romanovsky @ 2024-05-30 17:14 UTC (permalink / raw)
  To: Anand Khoje
  Cc: linux-rdma, linux-kernel, rama.nichanamatlu, manjunath.b.patil

On Wed, May 22, 2024 at 09:02:56AM +0530, Anand Khoje wrote:
> In non FLR context, at times CX-5 requests release of ~8 million device pages.
> This needs humongous number of cmd mailboxes, which to be released once
> the pages are reclaimed. Release of humongous number of cmd mailboxes
> consuming cpu time running into many secs, with non preemptable kernels
> is leading to critical process starving on that cpu’s RQ. To alleviate
> this, this patch relinquishes cpu periodically but conditionally.
> 
> Orabug: 36275016
> 
> Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> index 9c21bce..9fbf25d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> @@ -1336,16 +1336,23 @@ static struct mlx5_cmd_msg *mlx5_alloc_cmd_msg(struct mlx5_core_dev *dev,
>  	return ERR_PTR(err);
>  }
>  
> +#define RESCHED_MSEC 2
>  static void mlx5_free_cmd_msg(struct mlx5_core_dev *dev,
>  			      struct mlx5_cmd_msg *msg)
>  {
>  	struct mlx5_cmd_mailbox *head = msg->next;
>  	struct mlx5_cmd_mailbox *next;
> +	unsigned long start_time = jiffies;
>  
>  	while (head) {
>  		next = head->next;
>  		free_cmd_box(dev, head);

Did you consider to make this function asynchronous and parallel?

Thanks

>  		head = next;
> +		if (time_after(jiffies, start_time + msecs_to_jiffies(RESCHED_MSEC))) {
> +			mlx5_core_warn_rl(dev, "Spent more than %d msecs, yielding CPU\n", RESCHED_MSEC);
> +			cond_resched();
> +			start_time = jiffies;
> +		}
>  	}
>  	kfree(msg);
>  }
> -- 
> 1.8.3.1
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] RDMA/mlx5: Release CPU for other processes in mlx5_free_cmd_msg()
  2024-05-30 17:14   ` Leon Romanovsky
@ 2024-05-31  4:51     ` Anand Khoje
  2024-05-31 10:00       ` Leon Romanovsky
  0 siblings, 1 reply; 7+ messages in thread
From: Anand Khoje @ 2024-05-31  4:51 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: linux-rdma, linux-kernel, rama.nichanamatlu, manjunath.b.patil,
	netdev@vger.kernel.org


On 5/30/24 22:44, Leon Romanovsky wrote:
> On Wed, May 22, 2024 at 09:02:56AM +0530, Anand Khoje wrote:
>> In non FLR context, at times CX-5 requests release of ~8 million device pages.
>> This needs humongous number of cmd mailboxes, which to be released once
>> the pages are reclaimed. Release of humongous number of cmd mailboxes
>> consuming cpu time running into many secs, with non preemptable kernels
>> is leading to critical process starving on that cpu’s RQ. To alleviate
>> this, this patch relinquishes cpu periodically but conditionally.
>>
>> Orabug: 36275016
>>
>> Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
>> index 9c21bce..9fbf25d 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
>> @@ -1336,16 +1336,23 @@ static struct mlx5_cmd_msg *mlx5_alloc_cmd_msg(struct mlx5_core_dev *dev,
>>   	return ERR_PTR(err);
>>   }
>>   
>> +#define RESCHED_MSEC 2
>>   static void mlx5_free_cmd_msg(struct mlx5_core_dev *dev,
>>   			      struct mlx5_cmd_msg *msg)
>>   {
>>   	struct mlx5_cmd_mailbox *head = msg->next;
>>   	struct mlx5_cmd_mailbox *next;
>> +	unsigned long start_time = jiffies;
>>   
>>   	while (head) {
>>   		next = head->next;
>>   		free_cmd_box(dev, head);
> Did you consider to make this function asynchronous and parallel?
>
> Thanks

Hi Leon,

Thanks for reviewing this patch.

Here, all page related methods 
give_pages/reclaim_pages/release_all_pages are executed in a worker 
thread through pages_work_handler().

Doesn't that mean it is already asynchronous?

When the worker thread, in this case it is processing reclaim_pages(), 
is taking a long time - it is starving other processes on the processor 
that it is running on. Oracle UEK being a non-preemptible kernel, these 
other processes that are getting starved do not get CPU until the worker 
relinquishes the CPU. This applies to even processes that are time 
critical and high priority. These processes when starved of CPU for a 
long time, trigger a kernel panic.

Hence, this patch implements a time based relinquish of CPU using 
cond_resched().

Shay Dori, had a suggestion to tune the time (which we have made 2 
msec), to reduce too frequent context switching and find a balance in 
processing of these mailbox objects. I am presently running some tests 
on the basis of this suggestion.

Thanks,

Anand

>>   		head = next;
>> +		if (time_after(jiffies, start_time + msecs_to_jiffies(RESCHED_MSEC))) {
>> +			mlx5_core_warn_rl(dev, "Spent more than %d msecs, yielding CPU\n", RESCHED_MSEC);
>> +			cond_resched();
>> +			start_time = jiffies;
>> +		}
>>   	}
>>   	kfree(msg);
>>   }
>> -- 
>> 1.8.3.1
>>
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/1] RDMA/mlx5: Release CPU for other processes in mlx5_free_cmd_msg()
  2024-05-31  4:51     ` Anand Khoje
@ 2024-05-31 10:00       ` Leon Romanovsky
  0 siblings, 0 replies; 7+ messages in thread
From: Leon Romanovsky @ 2024-05-31 10:00 UTC (permalink / raw)
  To: Anand Khoje
  Cc: linux-rdma, linux-kernel, rama.nichanamatlu, manjunath.b.patil,
	netdev@vger.kernel.org

On Fri, May 31, 2024 at 10:21:39AM +0530, Anand Khoje wrote:
> 
> On 5/30/24 22:44, Leon Romanovsky wrote:
> > On Wed, May 22, 2024 at 09:02:56AM +0530, Anand Khoje wrote:
> > > In non FLR context, at times CX-5 requests release of ~8 million device pages.
> > > This needs humongous number of cmd mailboxes, which to be released once
> > > the pages are reclaimed. Release of humongous number of cmd mailboxes
> > > consuming cpu time running into many secs, with non preemptable kernels
> > > is leading to critical process starving on that cpu’s RQ. To alleviate
> > > this, this patch relinquishes cpu periodically but conditionally.
> > > 
> > > Orabug: 36275016
> > > 
> > > Signed-off-by: Anand Khoje <anand.a.khoje@oracle.com>
> > > ---
> > >   drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 7 +++++++
> > >   1 file changed, 7 insertions(+)
> > > 
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> > > index 9c21bce..9fbf25d 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
> > > @@ -1336,16 +1336,23 @@ static struct mlx5_cmd_msg *mlx5_alloc_cmd_msg(struct mlx5_core_dev *dev,
> > >   	return ERR_PTR(err);
> > >   }
> > > +#define RESCHED_MSEC 2
> > >   static void mlx5_free_cmd_msg(struct mlx5_core_dev *dev,
> > >   			      struct mlx5_cmd_msg *msg)
> > >   {
> > >   	struct mlx5_cmd_mailbox *head = msg->next;
> > >   	struct mlx5_cmd_mailbox *next;
> > > +	unsigned long start_time = jiffies;
> > >   	while (head) {
> > >   		next = head->next;
> > >   		free_cmd_box(dev, head);
> > Did you consider to make this function asynchronous and parallel?
> > 
> > Thanks
> 
> Hi Leon,
> 
> Thanks for reviewing this patch.
> 
> Here, all page related methods give_pages/reclaim_pages/release_all_pages
> are executed in a worker thread through pages_work_handler().
> 
> Doesn't that mean it is already asynchronous?

You didn't provide any performance data, so I can't say if it is related to work_handlers.

For example, we can be in this loop when we call to mlx5_cmd_disable()
and it will cause to synchronous calls to dma_pool_free() which holds
the spinlock.

Also pages_work_handler() runs through single threaded workqueue, it is
not asynchronous.

> 
> When the worker thread, in this case it is processing reclaim_pages(), is
> taking a long time - it is starving other processes on the processor that it
> is running on. Oracle UEK being a non-preemptible kernel, these other
> processes that are getting starved do not get CPU until the worker
> relinquishes the CPU. This applies to even processes that are time critical
> and high priority. These processes when starved of CPU for a long time,
> trigger a kernel panic.

Please add kernel panic and perf data to your commit message.

> 
> Hence, this patch implements a time based relinquish of CPU using
> cond_resched().
> 
> Shay Dori, had a suggestion to tune the time (which we have made 2 msec), to
> reduce too frequent context switching and find a balance in processing of
> these mailbox objects. I am presently running some tests on the basis of
> this suggestion.

You will have better results if you parallel page release.

Thanks

> 
> Thanks,
> 
> Anand
> 
> > >   		head = next;
> > > +		if (time_after(jiffies, start_time + msecs_to_jiffies(RESCHED_MSEC))) {
> > > +			mlx5_core_warn_rl(dev, "Spent more than %d msecs, yielding CPU\n", RESCHED_MSEC);
> > > +			cond_resched();
> > > +			start_time = jiffies;
> > > +		}
> > >   	}
> > >   	kfree(msg);
> > >   }
> > > -- 
> > > 1.8.3.1
> > > 
> > > 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-05-31 10:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-22  3:32 [PATCH 0/1] RDMA/mlx5: Release CPU for other processes in mlx5_free_cmd_msg() Anand Khoje
2024-05-22  3:32 ` [PATCH 1/1] " Anand Khoje
2024-05-26 15:23   ` Shay Drori
2024-05-29 12:01     ` Anand Khoje
2024-05-30 17:14   ` Leon Romanovsky
2024-05-31  4:51     ` Anand Khoje
2024-05-31 10:00       ` Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox