[PATCH net-next] net/mlx5e: Transmit small messages in linear skb

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
@ 2024-12-04 14:02 Alexandra Winter
  2024-12-04 14:16 ` Eric Dumazet
  2024-12-04 14:32 ` Alexander Lobakin
  0 siblings, 2 replies; 21+ messages in thread
From: Alexandra Winter @ 2024-12-04 14:02 UTC (permalink / raw)
  To: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn
  Cc: Nils Hoppmann, netdev, linux-s390, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thorsten Winkler, Simon Horman

Linearize the skb if the device uses IOMMU and the data buffer can fit
into one page. So messages can be transferred in one transfer to the card
instead of two.

Performance issue:
------------------
Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
tcp skbs are always non-linear. Especially on platforms with IOMMU,
mapping and unmapping two pages instead of one per transfer can make a
noticeable difference. On s390 we saw a 13% degradation in throughput,
when running uperf with a request-response pattern with 1k payload and
250 connections parallel. See [0] for a discussion.

This patch mitigates these effects using a work-around in the mlx5 driver.

Notes on implementation:
------------------------
TCP skbs never contain any tailroom, so skb_linearize() will allocate a
new data buffer.
No need to handle rc of skb_linearize(). If it fails, we continue with the
unchanged skb.

As mentioned in the discussion, an alternative, but more invasive approach
would be: premapping a coherent piece of memory in which you can copy
small skbs.

Measurement results:
--------------------
We see an improvement in throughput of up to 16% compared to kernel v6.12.
We measured throughput and CPU consumption of uperf benchmarks with
ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
with and without this patch.

+------------------------------------------+
| Transactions per Second - Deviation in % |
+-------------------+----------------------+
| Workload          |                      |
|  rr1c-1x1--50     |          4.75        |
|  rr1c-1x1-250     |         14.53        |
| rr1c-200x1000--50 |          2.22        |
| rr1c-200x1000-250 |         12.24        |
+-------------------+----------------------+
| Server CPU Consumption - Deviation in %  |
+-------------------+----------------------+
| Workload          |                      |
|  rr1c-1x1--50     |         -1.66        |
|  rr1c-1x1-250     |        -10.00        |
| rr1c-200x1000--50 |         -0.83        |
| rr1c-200x1000-250 |         -8.71        |
+-------------------+----------------------+

Note:
- CPU consumption: less is better
- Client CPU consumption is similar
- Workload:
  rr1c-<bytes send>x<bytes received>-<parallel connections>

  Highly transactional small data sizes (rr1c-1x1)
    This is a Request & Response (RR) test that sends a 1-byte request
    from the client and receives a 1-byte response from the server. This
    is the smallest possible transactional workload test and is smaller
    than most customer workloads. This test represents the RR overhead
    costs.
  Highly transactional medium data sizes (rr1c-200x1000)
    Request & Response (RR) test that sends a 200-byte request from the
    client and receives a 1000-byte response from the server. This test
    should be representative of a typical user's interaction with a remote
    web site.

Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0]
Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Co-developed-by: Nils Hoppmann <niho@linux.ibm.com>
Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f8c7912abe0e..421ba6798ca7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -32,6 +32,7 @@
 
 #include <linux/tcp.h>
 #include <linux/if_vlan.h>
+#include <linux/iommu-dma.h>
 #include <net/geneve.h>
 #include <net/dsfield.h>
 #include "en.h"
@@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 {
 	struct mlx5e_sq_stats *stats = sq->stats;
 
+	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
+	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
+		skb_linearize(skb);
+
 	if (skb_is_gso(skb)) {
 		int hopbyhop;
 		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-04 14:02 [PATCH net-next] net/mlx5e: Transmit small messages in linear skb Alexandra Winter
@ 2024-12-04 14:16 ` Eric Dumazet
  2024-12-04 14:35   ` Alexandra Winter
  2024-12-04 14:36   ` Eric Dumazet
  2024-12-04 14:32 ` Alexander Lobakin
  1 sibling, 2 replies; 21+ messages in thread
From: Eric Dumazet @ 2024-12-04 14:16 UTC (permalink / raw)
  To: Alexandra Winter
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	Nils Hoppmann, netdev, linux-s390, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thorsten Winkler, Simon Horman

On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@linux.ibm.com> wrote:
>
> Linearize the skb if the device uses IOMMU and the data buffer can fit
> into one page. So messages can be transferred in one transfer to the card
> instead of two.
>
> Performance issue:
> ------------------
> Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
> tcp skbs are always non-linear. Especially on platforms with IOMMU,
> mapping and unmapping two pages instead of one per transfer can make a
> noticeable difference. On s390 we saw a 13% degradation in throughput,
> when running uperf with a request-response pattern with 1k payload and
> 250 connections parallel. See [0] for a discussion.
>
> This patch mitigates these effects using a work-around in the mlx5 driver.
>
> Notes on implementation:
> ------------------------
> TCP skbs never contain any tailroom, so skb_linearize() will allocate a
> new data buffer.
> No need to handle rc of skb_linearize(). If it fails, we continue with the
> unchanged skb.
>
> As mentioned in the discussion, an alternative, but more invasive approach
> would be: premapping a coherent piece of memory in which you can copy
> small skbs.
>
> Measurement results:
> --------------------
> We see an improvement in throughput of up to 16% compared to kernel v6.12.
> We measured throughput and CPU consumption of uperf benchmarks with
> ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
> with and without this patch.
>
> +------------------------------------------+
> | Transactions per Second - Deviation in % |
> +-------------------+----------------------+
> | Workload          |                      |
> |  rr1c-1x1--50     |          4.75        |
> |  rr1c-1x1-250     |         14.53        |
> | rr1c-200x1000--50 |          2.22        |
> | rr1c-200x1000-250 |         12.24        |
> +-------------------+----------------------+
> | Server CPU Consumption - Deviation in %  |
> +-------------------+----------------------+
> | Workload          |                      |
> |  rr1c-1x1--50     |         -1.66        |
> |  rr1c-1x1-250     |        -10.00        |
> | rr1c-200x1000--50 |         -0.83        |
> | rr1c-200x1000-250 |         -8.71        |
> +-------------------+----------------------+
>
> Note:
> - CPU consumption: less is better
> - Client CPU consumption is similar
> - Workload:
>   rr1c-<bytes send>x<bytes received>-<parallel connections>
>
>   Highly transactional small data sizes (rr1c-1x1)
>     This is a Request & Response (RR) test that sends a 1-byte request
>     from the client and receives a 1-byte response from the server. This
>     is the smallest possible transactional workload test and is smaller
>     than most customer workloads. This test represents the RR overhead
>     costs.
>   Highly transactional medium data sizes (rr1c-200x1000)
>     Request & Response (RR) test that sends a 200-byte request from the
>     client and receives a 1000-byte response from the server. This test
>     should be representative of a typical user's interaction with a remote
>     web site.
>
> Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0]
> Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
> Co-developed-by: Nils Hoppmann <niho@linux.ibm.com>
> Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> index f8c7912abe0e..421ba6798ca7 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> @@ -32,6 +32,7 @@
>
>  #include <linux/tcp.h>
>  #include <linux/if_vlan.h>
> +#include <linux/iommu-dma.h>
>  #include <net/geneve.h>
>  #include <net/dsfield.h>
>  #include "en.h"
> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>  {
>         struct mlx5e_sq_stats *stats = sq->stats;
>
> +       /* Don't require 2 IOMMU TLB entries, if one is sufficient */
> +       if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
> +               skb_linearize(skb);
> +
>         if (skb_is_gso(skb)) {
>                 int hopbyhop;
>                 u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
> --
> 2.45.2


Was this tested on x86_64 or any other arch than s390, especially ones
with PAGE_SIZE = 65536 ?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-04 14:16 ` Eric Dumazet
@ 2024-12-04 14:35   ` Alexandra Winter
  2024-12-04 14:36   ` Eric Dumazet
  1 sibling, 0 replies; 21+ messages in thread
From: Alexandra Winter @ 2024-12-04 14:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	Nils Hoppmann, netdev, linux-s390, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thorsten Winkler, Simon Horman



On 04.12.24 15:16, Eric Dumazet wrote:
> On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@linux.ibm.com> wrote:
>>
>> Linearize the skb if the device uses IOMMU and the data buffer can fit
>> into one page. So messages can be transferred in one transfer to the card
>> instead of two.
>>
>> Performance issue:
>> ------------------
>> Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
>> tcp skbs are always non-linear. Especially on platforms with IOMMU,
>> mapping and unmapping two pages instead of one per transfer can make a
>> noticeable difference. On s390 we saw a 13% degradation in throughput,
>> when running uperf with a request-response pattern with 1k payload and
>> 250 connections parallel. See [0] for a discussion.
>>
>> This patch mitigates these effects using a work-around in the mlx5 driver.
>>
>> Notes on implementation:
>> ------------------------
>> TCP skbs never contain any tailroom, so skb_linearize() will allocate a
>> new data buffer.
>> No need to handle rc of skb_linearize(). If it fails, we continue with the
>> unchanged skb.
>>
>> As mentioned in the discussion, an alternative, but more invasive approach
>> would be: premapping a coherent piece of memory in which you can copy
>> small skbs.
>>
>> Measurement results:
>> --------------------
>> We see an improvement in throughput of up to 16% compared to kernel v6.12.
>> We measured throughput and CPU consumption of uperf benchmarks with
>> ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
>> with and without this patch.
>>
>> +------------------------------------------+
>> | Transactions per Second - Deviation in % |
>> +-------------------+----------------------+
>> | Workload          |                      |
>> |  rr1c-1x1--50     |          4.75        |
>> |  rr1c-1x1-250     |         14.53        |
>> | rr1c-200x1000--50 |          2.22        |
>> | rr1c-200x1000-250 |         12.24        |
>> +-------------------+----------------------+
>> | Server CPU Consumption - Deviation in %  |
>> +-------------------+----------------------+
>> | Workload          |                      |
>> |  rr1c-1x1--50     |         -1.66        |
>> |  rr1c-1x1-250     |        -10.00        |
>> | rr1c-200x1000--50 |         -0.83        |
>> | rr1c-200x1000-250 |         -8.71        |
>> +-------------------+----------------------+
>>
>> Note:
>> - CPU consumption: less is better
>> - Client CPU consumption is similar
>> - Workload:
>>   rr1c-<bytes send>x<bytes received>-<parallel connections>
>>
>>   Highly transactional small data sizes (rr1c-1x1)
>>     This is a Request & Response (RR) test that sends a 1-byte request
>>     from the client and receives a 1-byte response from the server. This
>>     is the smallest possible transactional workload test and is smaller
>>     than most customer workloads. This test represents the RR overhead
>>     costs.
>>   Highly transactional medium data sizes (rr1c-200x1000)
>>     Request & Response (RR) test that sends a 200-byte request from the
>>     client and receives a 1000-byte response from the server. This test
>>     should be representative of a typical user's interaction with a remote
>>     web site.
>>
>> Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0]
>> Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
>> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
>> Co-developed-by: Nils Hoppmann <niho@linux.ibm.com>
>> Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
>>  1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> index f8c7912abe0e..421ba6798ca7 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
>> @@ -32,6 +32,7 @@
>>
>>  #include <linux/tcp.h>
>>  #include <linux/if_vlan.h>
>> +#include <linux/iommu-dma.h>
>>  #include <net/geneve.h>
>>  #include <net/dsfield.h>
>>  #include "en.h"
>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>  {
>>         struct mlx5e_sq_stats *stats = sq->stats;
>>
>> +       /* Don't require 2 IOMMU TLB entries, if one is sufficient */
>> +       if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>> +               skb_linearize(skb);
>> +
>>         if (skb_is_gso(skb)) {
>>                 int hopbyhop;
>>                 u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
>> --
>> 2.45.2
> 
> 
> Was this tested on x86_64 or any other arch than s390, especially ones
> with PAGE_SIZE = 65536 ?
> 

No, I don't have a mlx5 card in an x86_64.
Rahul, could you test this patch?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-04 14:16 ` Eric Dumazet
  2024-12-04 14:35   ` Alexandra Winter
@ 2024-12-04 14:36   ` Eric Dumazet
  2024-12-06 14:47     ` David Laight
  2024-12-06 15:25     ` Alexandra Winter
  1 sibling, 2 replies; 21+ messages in thread
From: Eric Dumazet @ 2024-12-04 14:36 UTC (permalink / raw)
  To: Alexandra Winter
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	Nils Hoppmann, netdev, linux-s390, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thorsten Winkler, Simon Horman

On Wed, Dec 4, 2024 at 3:16 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Wed, Dec 4, 2024 at 3:02 PM Alexandra Winter <wintera@linux.ibm.com> wrote:
> >
> > Linearize the skb if the device uses IOMMU and the data buffer can fit
> > into one page. So messages can be transferred in one transfer to the card
> > instead of two.
> >
> > Performance issue:
> > ------------------
> > Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
> > tcp skbs are always non-linear. Especially on platforms with IOMMU,
> > mapping and unmapping two pages instead of one per transfer can make a
> > noticeable difference. On s390 we saw a 13% degradation in throughput,
> > when running uperf with a request-response pattern with 1k payload and
> > 250 connections parallel. See [0] for a discussion.
> >
> > This patch mitigates these effects using a work-around in the mlx5 driver.
> >
> > Notes on implementation:
> > ------------------------
> > TCP skbs never contain any tailroom, so skb_linearize() will allocate a
> > new data buffer.
> > No need to handle rc of skb_linearize(). If it fails, we continue with the
> > unchanged skb.
> >
> > As mentioned in the discussion, an alternative, but more invasive approach
> > would be: premapping a coherent piece of memory in which you can copy
> > small skbs.
> >
> > Measurement results:
> > --------------------
> > We see an improvement in throughput of up to 16% compared to kernel v6.12.
> > We measured throughput and CPU consumption of uperf benchmarks with
> > ConnectX-6 cards on s390 architecture and compared results of kernel v6.12
> > with and without this patch.
> >
> > +------------------------------------------+
> > | Transactions per Second - Deviation in % |
> > +-------------------+----------------------+
> > | Workload          |                      |
> > |  rr1c-1x1--50     |          4.75        |
> > |  rr1c-1x1-250     |         14.53        |
> > | rr1c-200x1000--50 |          2.22        |
> > | rr1c-200x1000-250 |         12.24        |
> > +-------------------+----------------------+
> > | Server CPU Consumption - Deviation in %  |
> > +-------------------+----------------------+
> > | Workload          |                      |
> > |  rr1c-1x1--50     |         -1.66        |
> > |  rr1c-1x1-250     |        -10.00        |
> > | rr1c-200x1000--50 |         -0.83        |
> > | rr1c-200x1000-250 |         -8.71        |
> > +-------------------+----------------------+
> >
> > Note:
> > - CPU consumption: less is better
> > - Client CPU consumption is similar
> > - Workload:
> >   rr1c-<bytes send>x<bytes received>-<parallel connections>
> >
> >   Highly transactional small data sizes (rr1c-1x1)
> >     This is a Request & Response (RR) test that sends a 1-byte request
> >     from the client and receives a 1-byte response from the server. This
> >     is the smallest possible transactional workload test and is smaller
> >     than most customer workloads. This test represents the RR overhead
> >     costs.
> >   Highly transactional medium data sizes (rr1c-200x1000)
> >     Request & Response (RR) test that sends a 200-byte request from the
> >     client and receives a 1000-byte response from the server. This test
> >     should be representative of a typical user's interaction with a remote
> >     web site.
> >
> > Link: https://lore.kernel.org/netdev/20220907122505.26953-1-wintera@linux.ibm.com/#t [0]
> > Suggested-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> > Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
> > Co-developed-by: Nils Hoppmann <niho@linux.ibm.com>
> > Signed-off-by: Nils Hoppmann <niho@linux.ibm.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> > index f8c7912abe0e..421ba6798ca7 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> > @@ -32,6 +32,7 @@
> >
> >  #include <linux/tcp.h>
> >  #include <linux/if_vlan.h>
> > +#include <linux/iommu-dma.h>
> >  #include <net/geneve.h>
> >  #include <net/dsfield.h>
> >  #include "en.h"
> > @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> >  {
> >         struct mlx5e_sq_stats *stats = sq->stats;
> >
> > +       /* Don't require 2 IOMMU TLB entries, if one is sufficient */
> > +       if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
> > +               skb_linearize(skb);
> > +
> >         if (skb_is_gso(skb)) {
> >                 int hopbyhop;
> >                 u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
> > --
> > 2.45.2
>
>
> Was this tested on x86_64 or any other arch than s390, especially ones
> with PAGE_SIZE = 65536 ?

I would suggest the opposite : copy the headers (typically less than
128 bytes) on a piece of coherent memory.

As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.

include/net/tso.h and net/core/tso.c users do this.

Sure, patch is going to be more invasive, but all arches will win.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-04 14:36   ` Eric Dumazet
@ 2024-12-06 14:47     ` David Laight
  2024-12-06 16:35       ` Eric Dumazet
  2024-12-06 15:25     ` Alexandra Winter
  1 sibling, 1 reply; 21+ messages in thread
From: David Laight @ 2024-12-06 14:47 UTC (permalink / raw)
  To: 'Eric Dumazet', Alexandra Winter
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	Nils Hoppmann, netdev@vger.kernel.org, linux-s390@vger.kernel.org,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Thorsten Winkler,
	Simon Horman

From: Eric Dumazet
> Sent: 04 December 2024 14:36
...
> I would suggest the opposite : copy the headers (typically less than
> 128 bytes) on a piece of coherent memory.

A long time ago a colleague tested the cutoff between copying to
a fixed buffer and dma access to the kernel memory buffer for
a sparc mbus/sbus system (which has an iommu).
While entirely different in all regards the cutoff was just over 1k.

The ethernet drivers I wrote did a data copy to/from a pre-mapped
area for both transmit and receive.
I suspect the simplicity of that also improved things.

These days you'd definitely want to map tso buffers.
But the 'copybreak' size for receive could be quite high.

On x86 just make sure the destination address for 'rep movsb'
is 64 byte aligned - it will double the copy speed.
The source alignment doesn't matter at all.
(AMD chips might be different, but an aligned copy of a whole
number of 'words' can always be done.)

I've also wondered whether the ethernet driver could 'hold' the
iommu page table entries after (eg) a receive frame is processed
and then drop the PA of the replacement buffer into the same slot.
That is likely to speed up iommu setup.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-06 14:47     ` David Laight
@ 2024-12-06 16:35       ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2024-12-06 16:35 UTC (permalink / raw)
  To: David Laight
  Cc: Alexandra Winter, Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan,
	Leon Romanovsky, David Miller, Jakub Kicinski, Paolo Abeni,
	Andrew Lunn, Nils Hoppmann, netdev@vger.kernel.org,
	linux-s390@vger.kernel.org, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thorsten Winkler, Simon Horman

On Fri, Dec 6, 2024 at 3:48 PM David Laight <David.Laight@aculab.com> wrote:
>
> I've also wondered whether the ethernet driver could 'hold' the
> iommu page table entries after (eg) a receive frame is processed
> and then drop the PA of the replacement buffer into the same slot.
> That is likely to speed up iommu setup.

This has been done a long time ago (Alexander Duyck page refcount
trick in Intel drivers)

Then put into generic page_pool

https://static.lwn.net/kerneldoc/networking/page_pool.html

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-04 14:36   ` Eric Dumazet
  2024-12-06 14:47     ` David Laight
@ 2024-12-06 15:25     ` Alexandra Winter
  2024-12-10 11:49       ` Dragos Tatulea
  1 sibling, 1 reply; 21+ messages in thread
From: Alexandra Winter @ 2024-12-06 15:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	Nils Hoppmann, netdev, linux-s390, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thorsten Winkler, Simon Horman



On 04.12.24 15:36, Eric Dumazet wrote:
> I would suggest the opposite : copy the headers (typically less than
> 128 bytes) on a piece of coherent memory.
> 
> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.
> 
> include/net/tso.h and net/core/tso.c users do this.
> 
> Sure, patch is going to be more invasive, but all arches will win.


Thank you very much for the examples, I think I understand what you are proposing.
I am not sure whether I'm able to map it to the mlx5 driver, but I could
try to come up with a RFC. It may take some time though.

NVidia people, any suggesttions? Do you want to handle that yourselves?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-06 15:25     ` Alexandra Winter
@ 2024-12-10 11:49       ` Dragos Tatulea
  2024-12-11 16:19         ` Alexandra Winter
  0 siblings, 1 reply; 21+ messages in thread
From: Dragos Tatulea @ 2024-12-10 11:49 UTC (permalink / raw)
  To: Alexandra Winter, Eric Dumazet
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	Nils Hoppmann, netdev, linux-s390, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thorsten Winkler, Simon Horman



On 06.12.24 16:25, Alexandra Winter wrote:
> 
> 
> On 04.12.24 15:36, Eric Dumazet wrote:
>> I would suggest the opposite : copy the headers (typically less than
>> 128 bytes) on a piece of coherent memory.
>>
>> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.
>>
>> include/net/tso.h and net/core/tso.c users do this.
>>
>> Sure, patch is going to be more invasive, but all arches will win.
> 
> 
> Thank you very much for the examples, I think I understand what you are proposing.
> I am not sure whether I'm able to map it to the mlx5 driver, but I could
> try to come up with a RFC. It may take some time though.
> 
> NVidia people, any suggesttions? Do you want to handle that yourselves?
> 
Discussed with Saeed and he proposed another approach that is better for
us: copy the whole skb payload inline into the WQE if it's size is below a
threshold. This threshold can be configured through the
tx-copybreak mechanism.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-10 11:49       ` Dragos Tatulea
@ 2024-12-11 16:19         ` Alexandra Winter
  2024-12-11 17:36           ` Dragos Tatulea
  0 siblings, 1 reply; 21+ messages in thread
From: Alexandra Winter @ 2024-12-11 16:19 UTC (permalink / raw)
  To: Dragos Tatulea, Eric Dumazet
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	Nils Hoppmann, netdev, linux-s390, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thorsten Winkler, Simon Horman, Niklas Schnelle

On 10.12.24 12:49, Dragos Tatulea wrote:
> 
> 
> On 06.12.24 16:25, Alexandra Winter wrote:
>>
>>
>> On 04.12.24 15:36, Eric Dumazet wrote:
>>> I would suggest the opposite : copy the headers (typically less than
>>> 128 bytes) on a piece of coherent memory.
>>>
>>> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.
>>>
>>> include/net/tso.h and net/core/tso.c users do this.
>>>
>>> Sure, patch is going to be more invasive, but all arches will win.
>>
>>
>> Thank you very much for the examples, I think I understand what you are proposing.
>> I am not sure whether I'm able to map it to the mlx5 driver, but I could
>> try to come up with a RFC. It may take some time though.
>>
>> NVidia people, any suggesttions? Do you want to handle that yourselves?
>>
> Discussed with Saeed and he proposed another approach that is better for
> us: copy the whole skb payload inline into the WQE if it's size is below a
> threshold. This threshold can be configured through the
> tx-copybreak mechanism.
> 
> Thanks,
> Dragos

Thank you very much Dargos and Saeed.
I am not sure I understand the details of "inline into the WQE". 
The idea seems to be to use a premapped coherent array per WQ 
that is indexed by queue element index and can be used to copy headers and
maybe small messages into.
I think I see something similar to your proposal in mlx4 (?).
To me the general concept seems to be similar to what Eric is proposing.
Did I get it right?

I really like the idea to use tx-copybreak for threshold configuration.

As Eric mentioned that is not a very small patch and maybe not fit for backporting
to older distro versions.
What do you think of a two-step approach as described in the other sub-thread?
A simple patch for mitigation that can be backported, and then the improvement
as a replacement?

Thanks,
Alexandra

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-11 16:19         ` Alexandra Winter
@ 2024-12-11 17:36           ` Dragos Tatulea
  0 siblings, 0 replies; 21+ messages in thread
From: Dragos Tatulea @ 2024-12-11 17:36 UTC (permalink / raw)
  To: Alexandra Winter, Eric Dumazet
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	Nils Hoppmann, netdev, linux-s390, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thorsten Winkler, Simon Horman, Niklas Schnelle

Hi Alexandra,

On 11.12.24 17:19, Alexandra Winter wrote:
> 
> 
> On 10.12.24 12:49, Dragos Tatulea wrote:
>>
>>
>> On 06.12.24 16:25, Alexandra Winter wrote:
>>>
>>>
>>> On 04.12.24 15:36, Eric Dumazet wrote:
>>>> I would suggest the opposite : copy the headers (typically less than
>>>> 128 bytes) on a piece of coherent memory.
>>>>
>>>> As a bonus, if skb->len is smaller than 256 bytes, copy the whole skb.
>>>>
>>>> include/net/tso.h and net/core/tso.c users do this.
>>>>
>>>> Sure, patch is going to be more invasive, but all arches will win.
>>>
>>>
>>> Thank you very much for the examples, I think I understand what you are proposing.
>>> I am not sure whether I'm able to map it to the mlx5 driver, but I could
>>> try to come up with a RFC. It may take some time though.
>>>
>>> NVidia people, any suggesttions? Do you want to handle that yourselves?
>>>
>> Discussed with Saeed and he proposed another approach that is better for
>> us: copy the whole skb payload inline into the WQE if it's size is below a
>> threshold. This threshold can be configured through the
>> tx-copybreak mechanism.
>>
>> Thanks,
>> Dragos
> 
> 
> Thank you very much Dargos and Saeed.
> I am not sure I understand the details of "inline into the WQE". 
> The idea seems to be to use a premapped coherent array per WQ 
> that is indexed by queue element index and can be used to copy headers and
> maybe small messages into.
> I think I see something similar to your proposal in mlx4 (?).
I think so, yes.

> To me the general concept seems to be similar to what Eric is proposing.
> Did I get it right?
>
AFAIU, it's not quite the same thing. With Eric's proposal we'd use an
additional premapped buffer, copy data there and sumbit WQEs with
pointers to this buffer. The inline proposal is to copy the data inline
in the WQE directly without the need of an additional buffer.
To understand if this is feasible, I still need to find out what is
actual max inline space is.

> I really like the idea to use tx-copybreak for threshold configuration.
> 
That's good to know. 
> As Eric mentioned that is not a very small patch and maybe not fit for backporting
> to older distro versions.
> What do you think of a two-step approach as described in the other sub-thread?
> A simple patch for mitigation that can be backported, and then the improvement
> as a replacement?
As stated in the previous mail, let's see what Tariq has to say about
doing some arch specific fix. I am not very optimistic though...

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-04 14:02 [PATCH net-next] net/mlx5e: Transmit small messages in linear skb Alexandra Winter
  2024-12-04 14:16 ` Eric Dumazet
@ 2024-12-04 14:32 ` Alexander Lobakin
  2024-12-06 15:20   ` Alexandra Winter
  1 sibling, 1 reply; 21+ messages in thread
From: Alexander Lobakin @ 2024-12-04 14:32 UTC (permalink / raw)
  To: Alexandra Winter
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Nils Hoppmann, netdev, linux-s390, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thorsten Winkler, Simon Horman

From: Alexandra Winter <wintera@linux.ibm.com>
Date: Wed,  4 Dec 2024 15:02:30 +0100

> Linearize the skb if the device uses IOMMU and the data buffer can fit
> into one page. So messages can be transferred in one transfer to the card
> instead of two.

I'd expect this to be on the generic level, not copied over the drivers?
Not sure about PAGE_SIZE, but I never saw a NIC/driver/platform where
copying let's say 256 bytes would be slower than 2x dma_map (even with
direct DMA).

> 
> Performance issue:
> ------------------
> Since commit 472c2e07eef0 ("tcp: add one skb cache for tx")
> tcp skbs are always non-linear. Especially on platforms with IOMMU,
> mapping and unmapping two pages instead of one per transfer can make a
> noticeable difference. On s390 we saw a 13% degradation in throughput,
> when running uperf with a request-response pattern with 1k payload and
> 250 connections parallel. See [0] for a discussion.
> 
> This patch mitigates these effects using a work-around in the mlx5 driver.
> 
> Notes on implementation:
> ------------------------
> TCP skbs never contain any tailroom, so skb_linearize() will allocate a
> new data buffer.
> No need to handle rc of skb_linearize(). If it fails, we continue with the
> unchanged skb.
> 
> As mentioned in the discussion, an alternative, but more invasive approach
> would be: premapping a coherent piece of memory in which you can copy
> small skbs.

Yes, that one would be better.

[...]

> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>  {
>  	struct mlx5e_sq_stats *stats = sq->stats;
>  
> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)

1. What's with the direct DMA? I believe it would benefit, too?
2. Why truesize, not something like

	if (skb->len <= some_sane_value_maybe_1k)

3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
   it's a good idea to rely on this.
   Some test-based hardcode would be enough (i.e. threshold on which
   DMA mapping starts performing better).

> +		skb_linearize(skb);
> +
>  	if (skb_is_gso(skb)) {

BTW can't there be a case when the skb is GSO, but its truesize is
PAGE_SIZE and linearize will be way too slow (not sure it's possible,
just guessing)?

>  		int hopbyhop;
>  		u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);

Thanks,
Olek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-04 14:32 ` Alexander Lobakin
@ 2024-12-06 15:20   ` Alexandra Winter
  2024-12-09 11:36     ` Tariq Toukan
  2024-12-10 11:44     ` Dragos Tatulea
  0 siblings, 2 replies; 21+ messages in thread
From: Alexandra Winter @ 2024-12-06 15:20 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Nils Hoppmann, netdev, linux-s390, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thorsten Winkler, Simon Horman



On 04.12.24 15:32, Alexander Lobakin wrote:
>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>  {
>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>  
>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
   +		skb_linearize(skb);
> 1. What's with the direct DMA? I believe it would benefit, too?


Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
Any opinions from the NVidia people?


> 2. Why truesize, not something like
> 
> 	if (skb->len <= some_sane_value_maybe_1k)


With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
When we set the threshhold at a smaller value, skb->len makes more sense


> 
> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>    it's a good idea to rely on this.
>    Some test-based hardcode would be enough (i.e. threshold on which
>    DMA mapping starts performing better).


A threshhold of 4k is absolutely fine with us (s390). 
A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.


NVidia people do you have any opinion on a good threshhold?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-06 15:20   ` Alexandra Winter
@ 2024-12-09 11:36     ` Tariq Toukan
  2024-12-10 11:44     ` Dragos Tatulea
  1 sibling, 0 replies; 21+ messages in thread
From: Tariq Toukan @ 2024-12-09 11:36 UTC (permalink / raw)
  To: Alexandra Winter, Alexander Lobakin
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Nils Hoppmann, netdev, linux-s390, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thorsten Winkler, Simon Horman



On 06/12/2024 17:20, Alexandra Winter wrote:
> 
> 
> On 04.12.24 15:32, Alexander Lobakin wrote:
>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>   {
>>>   	struct mlx5e_sq_stats *stats = sq->stats;
>>>   
>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>     +		skb_linearize(skb);
>> 1. What's with the direct DMA? I believe it would benefit, too?
> 
> 
> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
> Any opinions from the NVidia people?
> 
> 
>> 2. Why truesize, not something like
>>
>> 	if (skb->len <= some_sane_value_maybe_1k)
> 
> 
> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
> When we set the threshhold at a smaller value, skb->len makes more sense
> 
> 
>>
>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>     it's a good idea to rely on this.
>>     Some test-based hardcode would be enough (i.e. threshold on which
>>     DMA mapping starts performing better).
> 
> 
> A threshhold of 4k is absolutely fine with us (s390).
> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
> 
> 
> NVidia people do you have any opinion on a good threshhold?
> 

Hi,

Many approaches in the past few years are going the opposite direction, 
trying to avoid copies ("zero-copy").

In many cases, copy up to PAGE_SIZE means copy everything.
For high NIC speeds this is not realistic.

Anyway, based on past experience, threshold should not exceed "max 
header size" (128/256b).



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-06 15:20   ` Alexandra Winter
  2024-12-09 11:36     ` Tariq Toukan
@ 2024-12-10 11:44     ` Dragos Tatulea
  2024-12-10 13:54       ` Alexander Lobakin
  1 sibling, 1 reply; 21+ messages in thread
From: Dragos Tatulea @ 2024-12-10 11:44 UTC (permalink / raw)
  To: Alexandra Winter, Alexander Lobakin
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Nils Hoppmann, netdev, linux-s390, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thorsten Winkler, Simon Horman



On 06.12.24 16:20, Alexandra Winter wrote:
> 
> 
> On 04.12.24 15:32, Alexander Lobakin wrote:
>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>  {
>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>  
>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>    +		skb_linearize(skb);
>> 1. What's with the direct DMA? I believe it would benefit, too?
> 
> 
> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
> Any opinions from the NVidia people?
> 
Agreed.

> 
>> 2. Why truesize, not something like
>>
>> 	if (skb->len <= some_sane_value_maybe_1k)
> 
> 
> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
> When we set the threshhold at a smaller value, skb->len makes more sense
> 
> 
>>
>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>    it's a good idea to rely on this.
>>    Some test-based hardcode would be enough (i.e. threshold on which
>>    DMA mapping starts performing better).
> 
> 
> A threshhold of 4k is absolutely fine with us (s390). 
> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>
> 
> NVidia people do you have any opinion on a good threshhold?
> 
1KB is still to large. As Tariq mentioned, the threshold should not
exceed 128/256B. I am currently testing this with 256B on x86. So far no
regressions but I need to play with it more.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-10 11:44     ` Dragos Tatulea
@ 2024-12-10 13:54       ` Alexander Lobakin
  2024-12-10 17:10         ` Joe Damato
  2024-12-11 13:35         ` Alexandra Winter
  0 siblings, 2 replies; 21+ messages in thread
From: Alexander Lobakin @ 2024-12-10 13:54 UTC (permalink / raw)
  To: Dragos Tatulea
  Cc: Alexandra Winter, Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan,
	Leon Romanovsky, David Miller, Jakub Kicinski, Paolo Abeni,
	Eric Dumazet, Andrew Lunn, Nils Hoppmann, netdev, linux-s390,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Thorsten Winkler,
	Simon Horman

From: Dragos Tatulea <dtatulea@nvidia.com>
Date: Tue, 10 Dec 2024 12:44:04 +0100

> 
> 
> On 06.12.24 16:20, Alexandra Winter wrote:
>>
>>
>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>  {
>>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>>  
>>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>    +		skb_linearize(skb);
>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>
>>
>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>> Any opinions from the NVidia people?
>>
> Agreed.
> 
>>
>>> 2. Why truesize, not something like
>>>
>>> 	if (skb->len <= some_sane_value_maybe_1k)
>>
>>
>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>> When we set the threshhold at a smaller value, skb->len makes more sense
>>
>>
>>>
>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>    it's a good idea to rely on this.
>>>    Some test-based hardcode would be enough (i.e. threshold on which
>>>    DMA mapping starts performing better).
>>
>>
>> A threshhold of 4k is absolutely fine with us (s390). 
>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>
>>
>> NVidia people do you have any opinion on a good threshhold?
>>
> 1KB is still to large. As Tariq mentioned, the threshold should not
> exceed 128/256B. I am currently testing this with 256B on x86. So far no
> regressions but I need to play with it more.

On different setups, usually the copybreak of 192 or 256 bytes was the
most efficient as well.

> 
> Thanks,
> Dragos

Thanks,
Olek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-10 13:54       ` Alexander Lobakin
@ 2024-12-10 17:10         ` Joe Damato
  2024-12-11 13:35         ` Alexandra Winter
  1 sibling, 0 replies; 21+ messages in thread
From: Joe Damato @ 2024-12-10 17:10 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Dragos Tatulea, Alexandra Winter, Rahul Rameshbabu,
	Saeed Mahameed, Tariq Toukan, Leon Romanovsky, David Miller,
	Jakub Kicinski, Paolo Abeni, Eric Dumazet, Andrew Lunn,
	Nils Hoppmann, netdev, linux-s390, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Thorsten Winkler, Simon Horman

On Tue, Dec 10, 2024 at 02:54:26PM +0100, Alexander Lobakin wrote:
> From: Dragos Tatulea <dtatulea@nvidia.com>
> Date: Tue, 10 Dec 2024 12:44:04 +0100
> 
> > 
> > 
> > On 06.12.24 16:20, Alexandra Winter wrote:
> >>
> >>
> >> On 04.12.24 15:32, Alexander Lobakin wrote:
> >>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> >>>>  {
> >>>>  	struct mlx5e_sq_stats *stats = sq->stats;
> >>>>  
> >>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
> >>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
> >>    +		skb_linearize(skb);
> >>> 1. What's with the direct DMA? I believe it would benefit, too?
> >>
> >>
> >> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
> >> Any opinions from the NVidia people?
> >>
> > Agreed.
> > 
> >>
> >>> 2. Why truesize, not something like
> >>>
> >>> 	if (skb->len <= some_sane_value_maybe_1k)
> >>
> >>
> >> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
> >> When we set the threshhold at a smaller value, skb->len makes more sense
> >>
> >>
> >>>
> >>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
> >>>    it's a good idea to rely on this.
> >>>    Some test-based hardcode would be enough (i.e. threshold on which
> >>>    DMA mapping starts performing better).
> >>
> >>
> >> A threshhold of 4k is absolutely fine with us (s390). 
> >> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
> >>
> >>
> >> NVidia people do you have any opinion on a good threshhold?
> >>
> > 1KB is still to large. As Tariq mentioned, the threshold should not
> > exceed 128/256B. I am currently testing this with 256B on x86. So far no
> > regressions but I need to play with it more.
> 
> On different setups, usually the copybreak of 192 or 256 bytes was the
> most efficient as well.

A minor suggestion:

Would it be at all possible for the people who've run these
experiments to document their findings somewhere: what the different
test setups were, what the copybreak settings were, what the
results were, and how they were measured?

Some drivers have a few details documented in
Documentation/networking/device_drivers/ethernet/, but if others
could do this too, like mlx5, in detail so findings could be
reproduced by others, that would be amazing.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-10 13:54       ` Alexander Lobakin
  2024-12-10 17:10         ` Joe Damato
@ 2024-12-11 13:35         ` Alexandra Winter
  2024-12-11 17:28           ` Dragos Tatulea
  1 sibling, 1 reply; 21+ messages in thread
From: Alexandra Winter @ 2024-12-11 13:35 UTC (permalink / raw)
  To: Alexander Lobakin, Dragos Tatulea
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Nils Hoppmann, netdev, linux-s390, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thorsten Winkler, Simon Horman, Niklas Schnelle

On 10.12.24 14:54, Alexander Lobakin wrote:
> From: Dragos Tatulea <dtatulea@nvidia.com>
> Date: Tue, 10 Dec 2024 12:44:04 +0100
> 
>>
>>
>> On 06.12.24 16:20, Alexandra Winter wrote:
>>>
>>>
>>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>>  {
>>>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>>>  
>>>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>>    +		skb_linearize(skb);
>>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>>
>>>
>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>>> Any opinions from the NVidia people?
>>>
>> Agreed.
>>
>>>
>>>> 2. Why truesize, not something like
>>>>
>>>> 	if (skb->len <= some_sane_value_maybe_1k)
>>>
>>>
>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>>> When we set the threshhold at a smaller value, skb->len makes more sense
>>>
>>>
>>>>
>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>>    it's a good idea to rely on this.
>>>>    Some test-based hardcode would be enough (i.e. threshold on which
>>>>    DMA mapping starts performing better).
>>>
>>>
>>> A threshhold of 4k is absolutely fine with us (s390). 
>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>>
>>>
>>> NVidia people do you have any opinion on a good threshhold?
>>>

On 09.12.24 12:36, Tariq Toukan wrote:
> Hi,
> 
> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
> 
> In many cases, copy up to PAGE_SIZE means copy everything.
> For high NIC speeds this is not realistic.
> 
> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).

>> 1KB is still to large. As Tariq mentioned, the threshold should not
>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>> regressions but I need to play with it more.
> 
> On different setups, usually the copybreak of 192 or 256 bytes was the
> most efficient as well.
> 
>>

Thank you very much, everybody for looking into this.

Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390
and with this patch we get up to 12% more transactions/sec even for 1k messages.

As we're always using an IOMMU and are operating with large multi socket systems,
DMA costs far outweigh the costs of small to medium memcpy()s on our system.
We realize that the recommendation is to just run without IOMMU when performance is important,
but this is not an option in our environment.

I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a
strategic direction, it is more a simple mitigation of our problem.

Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements.
We could even restrict it to S390 arch, if you want.

A threshold of 256b would cover some amount of our typical request-response workloads
(think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k).
Could we maybe define an architecture dependent threshold?

My preferred scenario for the next steps would be the following:
1) It would be great if we could get a simple mitigation patch upstream, that the distros could
easily backport. This would avoid our customers experiencing performance regression when they
upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)

2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed).
This would then replace the simple mitigation patch upstream, and future releases would contain 
that solution. If somebody else wants to work on this improved version, fine with me, otherwise
I could give it a try.

What do you think of this approach?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-11 13:35         ` Alexandra Winter
@ 2024-12-11 17:28           ` Dragos Tatulea
  2024-12-11 17:50             ` Niklas Schnelle
  2024-12-12 10:36             ` Christian Borntraeger
  0 siblings, 2 replies; 21+ messages in thread
From: Dragos Tatulea @ 2024-12-11 17:28 UTC (permalink / raw)
  To: Alexandra Winter, Alexander Lobakin
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Nils Hoppmann, netdev, linux-s390, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thorsten Winkler, Simon Horman, Niklas Schnelle

[-- Attachment #1: Type: text/plain, Size: 7060 bytes --]

Hi Alexandra,

On 11.12.24 14:35, Alexandra Winter wrote:
> 
> 
> On 10.12.24 14:54, Alexander Lobakin wrote:
>> From: Dragos Tatulea <dtatulea@nvidia.com>
>> Date: Tue, 10 Dec 2024 12:44:04 +0100
>>
>>>
>>>
>>> On 06.12.24 16:20, Alexandra Winter wrote:
>>>>
>>>>
>>>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>>>  {
>>>>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>>>>  
>>>>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>>>    +		skb_linearize(skb);
>>>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>>>
>>>>
>>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>>>> Any opinions from the NVidia people?
>>>>
>>> Agreed.
>>>
>>>>
>>>>> 2. Why truesize, not something like
>>>>>
>>>>> 	if (skb->len <= some_sane_value_maybe_1k)
>>>>
>>>>
>>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>>>> When we set the threshhold at a smaller value, skb->len makes more sense
>>>>
>>>>
>>>>>
>>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>>>    it's a good idea to rely on this.
>>>>>    Some test-based hardcode would be enough (i.e. threshold on which
>>>>>    DMA mapping starts performing better).
>>>>
>>>>
>>>> A threshhold of 4k is absolutely fine with us (s390). 
>>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>>>
>>>>
>>>> NVidia people do you have any opinion on a good threshhold?
>>>>
> 
> On 09.12.24 12:36, Tariq Toukan wrote:
>> Hi,
>>
>> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
>>
>> In many cases, copy up to PAGE_SIZE means copy everything.
>> For high NIC speeds this is not realistic.
>>
>> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
> 
>>> 1KB is still to large. As Tariq mentioned, the threshold should not
>>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>>> regressions but I need to play with it more.
I checked on a x86 system with CX7 and we seem to get ~4% degradation
when using this approach. The patch was modified a bit according to
previous discussions (diff at end of mail).

Here's how I tested:
- uperf client side has many queues.
- uperf server side has single queue with interrupts pinned to a single
  CPU. This is to better isolate CPU behaviour. The idea is to have the
  CPU on the server side saturated or close to saturation.
- Used the uperf 1B read x 1B write scenario with 50 and 100 threads
  (profile attached).
  Both the on-cpu and off-cpu cases were checked.
- Code change was done only on server side.

The results:
```
Case:                                          Throughput   Operations
----------------------------------------------------------------------
app cpu == irq cpu, nthreads= 25, baseline     3.86Mb/s     30036552  
app cpu == irq cpu, nthreads= 25, skb_linear   3.52Mb/s     26969315  

app cpu == irq cpu, nthreads= 50, baseline     4.26Mb/s     33122458 
app cpu == irq cpu, nthreads= 50, skb_linear   4.06Mb/s     31606290 

app cpu == irq cpu, nthreads=100, baseline     4.08Mb/s     31775434  
app cpu == irq cpu, nthreads=100, skb_linear   3.86Mb/s     30105582

app cpu != irq cpu, nthreads= 25, baseline     3.57Mb/s     27785488
app cpu != irq cpu, nthreads= 25, skb_linear   3.56Mb/s     27730133

app cpu != irq cpu, nthreads= 50, baseline     3.97Mb/s     30876264
app cpu != irq cpu, nthreads= 50, skb_linear   3.82Mb/s     29737654

app cpu != irq cpu, nthreads=100, baseline     4.06Mb/s     31621140
app cpu != irq cpu, nthreads=100, skb_linear   3.90Mb/s     30364382
```

So not encouraging. I restricted the tests to 1B payloads only as I
expected worse perf for larger payloads.

>>
>> On different setups, usually the copybreak of 192 or 256 bytes was the
>> most efficient as well.
>>
>>>
> 
> 
> Thank you very much, everybody for looking into this.
> 
> Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390
> and with this patch we get up to 12% more transactions/sec even for 1k messages.
> 
> As we're always using an IOMMU and are operating with large multi socket systems,
> DMA costs far outweigh the costs of small to medium memcpy()s on our system.
> We realize that the recommendation is to just run without IOMMU when performance is important,
> but this is not an option in our environment.
> 
> I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a
> strategic direction, it is more a simple mitigation of our problem.
> 
> Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements.
> We could even restrict it to S390 arch, if you want.
>
Maybe Tariq can comment on this.

> A threshold of 256b would cover some amount of our typical request-response workloads
> (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k).
> Could we maybe define an architecture dependent threshold?
> 
> My preferred scenario for the next steps would be the following:
> 1) It would be great if we could get a simple mitigation patch upstream, that the distros could
> easily backport. This would avoid our customers experiencing performance regression when they
> upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)
>
Stupid question on my behalf: why can't this patch be taken as a distro
patch for s390 and carried over over releases? This way the kernel
upgrade pain would be avoided.

> 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed).
> This would then replace the simple mitigation patch upstream, and future releases would contain 
> that solution. If somebody else wants to work on this improved version, fine with me, otherwise
> I could give it a try.
> 
For the inline WQE solution I don't think we have a large amout of space
to copy so much data into. For Eric's side buffer suggestion it will be
even more invasive and it will touch many more code paths...

> What do you think of this approach?
> 
>
Sounds tricky. Let's see what Tariq has to say.

Thanks,
Dragos

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f8c7912abe0e..cc947daa538b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -269,6 +269,9 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 {
        struct mlx5e_sq_stats *stats = sq->stats;
 
+       if (skb->len < 256)
+               skb_linearize(skb);
+
        if (skb_is_gso(skb)) {
                int hopbyhop;
                u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);

[-- Attachment #2: rr1c-1x1.xml --]
[-- Type: text/xml, Size: 522 bytes --]

<?xml version="1.0"?>
<profile name="netperf">
  <group nthreads="$nthreads">
        <transaction iterations="$iter">
            <flowop type="accept" options="remotehost=$h1 protocol=$proto tcp_nodelay"/>
        </transaction>
        <transaction duration="$dur">
            <flowop type="read" options="size=1"/>
            <flowop type="write" options="size=1"/>
        </transaction>
        <transaction iterations="$iter">
            <flowop type="disconnect" />
        </transaction>
  </group>
</profile>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-11 17:28           ` Dragos Tatulea
@ 2024-12-11 17:50             ` Niklas Schnelle
  2024-12-13 20:41               ` Dragos Tatulea
  2024-12-12 10:36             ` Christian Borntraeger
  1 sibling, 1 reply; 21+ messages in thread
From: Niklas Schnelle @ 2024-12-11 17:50 UTC (permalink / raw)
  To: Dragos Tatulea, Alexandra Winter, Alexander Lobakin
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Nils Hoppmann, netdev, linux-s390, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thorsten Winkler, Simon Horman, Jason Gunthorpe

On Wed, 2024-12-11 at 18:28 +0100, Dragos Tatulea wrote:
> > > > > 

---8<---

> 
> > On 09.12.24 12:36, Tariq Toukan wrote:
> > > Hi,
> > > 
> > > Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
> > > 
> > > In many cases, copy up to PAGE_SIZE means copy everything.
> > > For high NIC speeds this is not realistic.
> > > 
> > > Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
> > 
> > > > 1KB is still to large. As Tariq mentioned, the threshold should not
> > > > exceed 128/256B. I am currently testing this with 256B on x86. So far no
> > > > regressions but I need to play with it more.
> I checked on a x86 system with CX7 and we seem to get ~4% degradation
> when using this approach. The patch was modified a bit according to
> previous discussions (diff at end of mail).
> 
> Here's how I tested:
> - uperf client side has many queues.
> - uperf server side has single queue with interrupts pinned to a single
>   CPU. This is to better isolate CPU behaviour. The idea is to have the
>   CPU on the server side saturated or close to saturation.
> - Used the uperf 1B read x 1B write scenario with 50 and 100 threads
>   (profile attached).
>   Both the on-cpu and off-cpu cases were checked.
> - Code change was done only on server side.

I'm assuming this is with the x86 default IOMMU pass-through mode?
Would you be able and willing to try with iommu.passthrough=0
and amd_iommu=on respectively intel_iommu=on? Check
/sys/bus/pci/devices/<dev>/iommu_group/type for "DMA-FQ" to make sure
the dma-iommu code is used. This is obviously not a "tuned for all out
perf at any cost" configuration but it is recommended in hardening
guides and I believe some ARM64 systems also default to using the IOMMU
for bare metal DMA API use. So it's not an unexpected configuration
either.

Thanks,
Niklas



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-11 17:50             ` Niklas Schnelle
@ 2024-12-13 20:41               ` Dragos Tatulea
  0 siblings, 0 replies; 21+ messages in thread
From: Dragos Tatulea @ 2024-12-13 20:41 UTC (permalink / raw)
  To: Niklas Schnelle, Alexandra Winter, Alexander Lobakin
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Nils Hoppmann, netdev, linux-s390, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Thorsten Winkler, Simon Horman, Jason Gunthorpe



On 11.12.24 18:50, Niklas Schnelle wrote:
> On Wed, 2024-12-11 at 18:28 +0100, Dragos Tatulea wrote:
>>>>>>
> 
> ---8<---
> 
>>
>>> On 09.12.24 12:36, Tariq Toukan wrote:
>>>> Hi,
>>>>
>>>> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
>>>>
>>>> In many cases, copy up to PAGE_SIZE means copy everything.
>>>> For high NIC speeds this is not realistic.
>>>>
>>>> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
>>>
>>>>> 1KB is still to large. As Tariq mentioned, the threshold should not
>>>>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>>>>> regressions but I need to play with it more.
>> I checked on a x86 system with CX7 and we seem to get ~4% degradation
>> when using this approach. The patch was modified a bit according to
>> previous discussions (diff at end of mail).
>>
>> Here's how I tested:
>> - uperf client side has many queues.
>> - uperf server side has single queue with interrupts pinned to a single
>>   CPU. This is to better isolate CPU behaviour. The idea is to have the
>>   CPU on the server side saturated or close to saturation.
>> - Used the uperf 1B read x 1B write scenario with 50 and 100 threads
>>   (profile attached).
>>   Both the on-cpu and off-cpu cases were checked.
>> - Code change was done only on server side.
> 
> I'm assuming this is with the x86 default IOMMU pass-through mode?
It was in a VM with PCI passthrough for the device.

> Would you be able and willing to try with iommu.passthrough=0
> and amd_iommu=on respectively intel_iommu=on? Check
> /sys/bus/pci/devices/<dev>/iommu_group/type for "DMA-FQ" to make sure
> the dma-iommu code is used. This is obviously not a "tuned for all out
> perf at any cost" configuration but it is recommended in hardening
> guides and I believe some ARM64 systems also default to using the IOMMU
> for bare metal DMA API use. So it's not an unexpected configuration
> either.
> 
I got hold of a bare metal system where I could turn iommu passthrough
off and confirm iommu_group/type as being DMA_FQ. But the results are
inconclusive due to instability. I will look into this again after the
holidays.

Thanks,
Dragos

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
  2024-12-11 17:28           ` Dragos Tatulea
  2024-12-11 17:50             ` Niklas Schnelle
@ 2024-12-12 10:36             ` Christian Borntraeger
  1 sibling, 0 replies; 21+ messages in thread
From: Christian Borntraeger @ 2024-12-12 10:36 UTC (permalink / raw)
  To: Dragos Tatulea, Alexandra Winter, Alexander Lobakin
  Cc: Rahul Rameshbabu, Saeed Mahameed, Tariq Toukan, Leon Romanovsky,
	David Miller, Jakub Kicinski, Paolo Abeni, Eric Dumazet,
	Andrew Lunn, Nils Hoppmann, netdev, linux-s390, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Sven Schnelle, Thorsten Winkler,
	Simon Horman, Niklas Schnelle



Am 11.12.24 um 18:28 schrieb Dragos Tatulea:

>> My preferred scenario for the next steps would be the following:
>> 1) It would be great if we could get a simple mitigation patch upstream, that the distros could
>> easily backport. This would avoid our customers experiencing performance regression when they
>> upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)
>>
> Stupid question on my behalf: why can't this patch be taken as a distro
> patch for s390 and carried over over releases? This way the kernel
> upgrade pain would be avoided.

This is not how distros work today. All code/patches must come from upstream
and all code/patches must be cross-architecture. There are exceptions, but
only for very critical things.

Furthermore, the right answer to avoid upgrade pain is to fix things upstream.
So lets try to find a solution that we can integrate.

Christian


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-12-13 20:41 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-04 14:02 [PATCH net-next] net/mlx5e: Transmit small messages in linear skb Alexandra Winter
2024-12-04 14:16 ` Eric Dumazet
2024-12-04 14:35   ` Alexandra Winter
2024-12-04 14:36   ` Eric Dumazet
2024-12-06 14:47     ` David Laight
2024-12-06 16:35       ` Eric Dumazet
2024-12-06 15:25     ` Alexandra Winter
2024-12-10 11:49       ` Dragos Tatulea
2024-12-11 16:19         ` Alexandra Winter
2024-12-11 17:36           ` Dragos Tatulea
2024-12-04 14:32 ` Alexander Lobakin
2024-12-06 15:20   ` Alexandra Winter
2024-12-09 11:36     ` Tariq Toukan
2024-12-10 11:44     ` Dragos Tatulea
2024-12-10 13:54       ` Alexander Lobakin
2024-12-10 17:10         ` Joe Damato
2024-12-11 13:35         ` Alexandra Winter
2024-12-11 17:28           ` Dragos Tatulea
2024-12-11 17:50             ` Niklas Schnelle
2024-12-13 20:41               ` Dragos Tatulea
2024-12-12 10:36             ` Christian Borntraeger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).