[RFC net] tcp: Fix performance regression for request-response workloads

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC net] tcp: Fix performance regression for request-response workloads
@ 2022-09-07 12:25 Alexandra Winter
  2022-09-07 16:06 ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandra Winter @ 2022-09-07 12:25 UTC (permalink / raw)
  To: Eric Dumazet, David Miller, Jakub Kicinski
  Cc: Yuchung Cheng, Soheil Hassas Yeganeh, Willem de Bruijn,
	Paolo Abeni, Mat Martineau, Saeed Mahameed, Niklas Schnelle,
	Christian Borntraeger, netdev, linux-s390, Heiko Carstens,
	Alexandra Winter

Since linear payload was removed even for single small messages,
an additional page is required and we are measuring performance impact.

3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
explicitely allowed "payload in skb->head for first skb put in the queue,
to not impact RPC workloads."
472c2e07eef0 ("tcp: add one skb cache for tx")
made that obsolete and removed it.
When 
d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
reverted it, this piece was not reverted and not added back in.

When running uperf with a request-response pattern with 1k payload
and 250 connections parallel, we measure 13% difference in throughput
for our PCI based network interfaces since 472c2e07eef0.
(our IO MMU is sensitive to the number of mapped pages)

Could you please consider allowing linear payload for the first
skb in queue again? A patch proposal is appended below.

Kind regards
Alexandra

---------------------------------------------------------------

tcp: allow linear skb payload for first in queue

Allow payload in skb->head for first skb in the queue,
RPC workloads will benefit.

Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx")
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
---
 net/ipv4/tcp.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e5011c136fdb..f7cbccd41d85 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1154,6 +1154,30 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
 }
 EXPORT_SYMBOL(tcp_sendpage);
 
+/* Do not bother using a page frag for very small frames.
+ * But use this heuristic only for the first skb in write queue.
+ *
+ * Having no payload in skb->head allows better SACK shifting
+ * in tcp_shift_skb_data(), reducing sack/rack overhead, because
+ * write queue has less skbs.
+ * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB.
+ * This also speeds up tso_fragment(), since it won't fallback
+ * to tcp_fragment().
+ */
+static int linear_payload_sz(bool first_skb)
+{
+		if (first_skb)
+			return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
+		return 0;
+}
+
+static int select_size(bool first_skb, bool zc)
+{
+		if (zc)
+			return 0;
+		return linear_payload_sz(first_skb);
+}
+
 void tcp_free_fastopen_req(struct tcp_sock *tp)
 {
 	if (tp->fastopen_req) {
@@ -1311,6 +1335,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 
 		if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
 			bool first_skb;
+			int linear;
 
 new_segment:
 			if (!sk_stream_memory_free(sk))
@@ -1322,7 +1347,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 					goto restart;
 			}
 			first_skb = tcp_rtx_and_write_queues_empty(sk);
-			skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
+			linear = select_size(first_skb, zc);
+			skb = tcp_stream_alloc_skb(sk, linear,
+						   sk->sk_allocation,
 						   first_skb);
 			if (!skb)
 				goto wait_for_space;
@@ -1344,7 +1371,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (copy > msg_data_left(msg))
 			copy = msg_data_left(msg);
 
-		if (!zc) {
+		/* Where to copy to? */
+		if (skb_availroom(skb) > 0 && !zc) {
+			/* We have some space in skb head. Superb! */
+			copy = min_t(int, copy, skb_availroom(skb));
+			err = skb_add_data_nocache(sk, skb, &msg->msg_iter,
+						   copy);
+			if (err)
+				goto do_error;
+		} else if (!zc) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
 			struct page_frag *pfrag = sk_page_frag(sk);
-- 
2.24.3 (Apple Git-128)


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC net] tcp: Fix performance regression for request-response workloads
  2022-09-07 12:25 [RFC net] tcp: Fix performance regression for request-response workloads Alexandra Winter
@ 2022-09-07 16:06 ` Eric Dumazet
  2022-09-08  9:40   ` Christian Borntraeger
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2022-09-07 16:06 UTC (permalink / raw)
  To: Alexandra Winter
  Cc: David Miller, Jakub Kicinski, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Paolo Abeni,
	Mat Martineau, Saeed Mahameed, Niklas Schnelle,
	Christian Borntraeger, netdev, linux-s390, Heiko Carstens

On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote:
>
> Since linear payload was removed even for single small messages,
> an additional page is required and we are measuring performance impact.
>
> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
> explicitely allowed "payload in skb->head for first skb put in the queue,
> to not impact RPC workloads."
> 472c2e07eef0 ("tcp: add one skb cache for tx")
> made that obsolete and removed it.
> When
> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
> reverted it, this piece was not reverted and not added back in.
>
> When running uperf with a request-response pattern with 1k payload
> and 250 connections parallel, we measure 13% difference in throughput
> for our PCI based network interfaces since 472c2e07eef0.
> (our IO MMU is sensitive to the number of mapped pages)



>
> Could you please consider allowing linear payload for the first
> skb in queue again? A patch proposal is appended below.

No.

Please add a work around in your driver.

You can increase throughput by 20% by premapping a coherent piece of
memory in which
you can copy small skbs (skb->head included)

Something like 256 bytes per slot in the TX ring.


>
> Kind regards
> Alexandra
>
> ---------------------------------------------------------------
>
> tcp: allow linear skb payload for first in queue
>
> Allow payload in skb->head for first skb in the queue,
> RPC workloads will benefit.
>
> Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx")
> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
> ---
>  net/ipv4/tcp.c | 39 +++++++++++++++++++++++++++++++++++++--
>  1 file changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e5011c136fdb..f7cbccd41d85 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -1154,6 +1154,30 @@ int tcp_sendpage(struct sock *sk, struct page *page, int offset,
>  }
>  EXPORT_SYMBOL(tcp_sendpage);
>
> +/* Do not bother using a page frag for very small frames.
> + * But use this heuristic only for the first skb in write queue.
> + *
> + * Having no payload in skb->head allows better SACK shifting
> + * in tcp_shift_skb_data(), reducing sack/rack overhead, because
> + * write queue has less skbs.
> + * Each skb can hold up to MAX_SKB_FRAGS * 32Kbytes, or ~0.5 MB.
> + * This also speeds up tso_fragment(), since it won't fallback
> + * to tcp_fragment().
> + */
> +static int linear_payload_sz(bool first_skb)
> +{
> +               if (first_skb)
> +                       return SKB_WITH_OVERHEAD(2048 - MAX_TCP_HEADER);
> +               return 0;
> +}
> +
> +static int select_size(bool first_skb, bool zc)
> +{
> +               if (zc)
> +                       return 0;
> +               return linear_payload_sz(first_skb);
> +}
> +
>  void tcp_free_fastopen_req(struct tcp_sock *tp)
>  {
>         if (tp->fastopen_req) {
> @@ -1311,6 +1335,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>
>                 if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
>                         bool first_skb;
> +                       int linear;
>
>  new_segment:
>                         if (!sk_stream_memory_free(sk))
> @@ -1322,7 +1347,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                                         goto restart;
>                         }
>                         first_skb = tcp_rtx_and_write_queues_empty(sk);
> -                       skb = tcp_stream_alloc_skb(sk, 0, sk->sk_allocation,
> +                       linear = select_size(first_skb, zc);
> +                       skb = tcp_stream_alloc_skb(sk, linear,
> +                                                  sk->sk_allocation,
>                                                    first_skb);
>                         if (!skb)
>                                 goto wait_for_space;
> @@ -1344,7 +1371,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
>                 if (copy > msg_data_left(msg))
>                         copy = msg_data_left(msg);
>
> -               if (!zc) {
> +               /* Where to copy to? */
> +               if (skb_availroom(skb) > 0 && !zc) {
> +                       /* We have some space in skb head. Superb! */
> +                       copy = min_t(int, copy, skb_availroom(skb));
> +                       err = skb_add_data_nocache(sk, skb, &msg->msg_iter,
> +                                                  copy);
> +                       if (err)
> +                               goto do_error;
> +               } else if (!zc) {
>                         bool merge = true;
>                         int i = skb_shinfo(skb)->nr_frags;
>                         struct page_frag *pfrag = sk_page_frag(sk);
> --
> 2.24.3 (Apple Git-128)
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC net] tcp: Fix performance regression for request-response workloads
  2022-09-07 16:06 ` Eric Dumazet
@ 2022-09-08  9:40   ` Christian Borntraeger
  2022-09-08 12:41     ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Christian Borntraeger @ 2022-09-08  9:40 UTC (permalink / raw)
  To: Eric Dumazet, Alexandra Winter
  Cc: David Miller, Jakub Kicinski, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Paolo Abeni,
	Mat Martineau, Saeed Mahameed, Niklas Schnelle, netdev,
	linux-s390, Heiko Carstens

Am 07.09.22 um 18:06 schrieb Eric Dumazet:
> On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote:
>>
>> Since linear payload was removed even for single small messages,
>> an additional page is required and we are measuring performance impact.
>>
>> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
>> explicitely allowed "payload in skb->head for first skb put in the queue,
>> to not impact RPC workloads."
>> 472c2e07eef0 ("tcp: add one skb cache for tx")
>> made that obsolete and removed it.
>> When
>> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
>> reverted it, this piece was not reverted and not added back in.
>>
>> When running uperf with a request-response pattern with 1k payload
>> and 250 connections parallel, we measure 13% difference in throughput
>> for our PCI based network interfaces since 472c2e07eef0.
>> (our IO MMU is sensitive to the number of mapped pages)
> 
> 
> 
>>
>> Could you please consider allowing linear payload for the first
>> skb in queue again? A patch proposal is appended below.
> 
> No.
> 
> Please add a work around in your driver.
> 
> You can increase throughput by 20% by premapping a coherent piece of
> memory in which
> you can copy small skbs (skb->head included)
> 
> Something like 256 bytes per slot in the TX ring.
> 

FWIW this regression was withthe standard mellanox driver (nothing s390 specific).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC net] tcp: Fix performance regression for request-response workloads
  2022-09-08  9:40   ` Christian Borntraeger
@ 2022-09-08 12:41     ` Eric Dumazet
  2022-09-26 10:06       ` [RFC net] net/mlx5: " Alexandra Winter
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2022-09-08 12:41 UTC (permalink / raw)
  To: Christian Borntraeger
  Cc: Alexandra Winter, David Miller, Jakub Kicinski, Yuchung Cheng,
	Soheil Hassas Yeganeh, Willem de Bruijn, Paolo Abeni,
	Mat Martineau, Saeed Mahameed, Niklas Schnelle, netdev,
	linux-s390, Heiko Carstens

On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger
<borntraeger@linux.ibm.com> wrote:
>
> Am 07.09.22 um 18:06 schrieb Eric Dumazet:
> > On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote:
> >>
> >> Since linear payload was removed even for single small messages,
> >> an additional page is required and we are measuring performance impact.
> >>
> >> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
> >> explicitely allowed "payload in skb->head for first skb put in the queue,
> >> to not impact RPC workloads."
> >> 472c2e07eef0 ("tcp: add one skb cache for tx")
> >> made that obsolete and removed it.
> >> When
> >> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
> >> reverted it, this piece was not reverted and not added back in.
> >>
> >> When running uperf with a request-response pattern with 1k payload
> >> and 250 connections parallel, we measure 13% difference in throughput
> >> for our PCI based network interfaces since 472c2e07eef0.
> >> (our IO MMU is sensitive to the number of mapped pages)
> >
> >
> >
> >>
> >> Could you please consider allowing linear payload for the first
> >> skb in queue again? A patch proposal is appended below.
> >
> > No.
> >
> > Please add a work around in your driver.
> >
> > You can increase throughput by 20% by premapping a coherent piece of
> > memory in which
> > you can copy small skbs (skb->head included)
> >
> > Something like 256 bytes per slot in the TX ring.
> >
>
> FWIW this regression was withthe standard mellanox driver (nothing s390 specific).

I did not claim this was s390 specific.

Only IOMMU mode.

I would rather not add back something which makes TCP stack slower
(more tests in fast path)
for the majority of us _not_ using IOMMU.

In our own tests, this trick of using linear skbs was only helping
benchmarks, not real workloads.

Many drivers have to map skb->head a second time if they contain TCP payload,
thus adding yet another corner case in their fast path.

- Typical RPC workloads are playing with TCP_NODELAY
- Typical bulk flows never have empty write queues...

Really, I do not want this optimization back, this is not worth it.

Again, a driver knows better if it is using IOMMU and if pathological
layouts can be optimized
to non SG ones, and using a pre-dma-map zone will also benefit pure
TCP ACK packets (which do not have any payload)

Here is the changelog of a patch I did for our GQ NIC (not yet
upstreamed, but will be soon)

...
   The problem is coming from gq_tx_clean() calling
    dma_unmap_single(q->dev, p->addr, -p->len, DMA_TO_DEVICE);

    This seems silly to perform possibly expensive IOMMU operations to
send small packets.
    (TCP pure acks are 86 bytes long in total for 99% of the cases)

    Idea of this patch is to pre-dma-map a memory zone to hold the
headers of the
    packet (if less than 128/256 bytes long)

    Then if the whole packet can be copied into this 128/256 bytes
zone, just copy it
    entirely.

    This permits to consume the small packets right away in ndo_start_xmit()
    while the skb (and associated socket sk_wmem_alloc) is hot, instead of later
    at TX completion time.
    This makes ACK packets cost much smaller, but also tiny TCP
packets (say, synthetic benchmarks)

    We enable this behavior only if IOMMU is used/forced on GQ,
    although we might use it regardless of IOMMU being used or not.
...
    To recap, there is a huge difference if we cross the 42 byte limit
: (for a 128 bytes zone per TX ring slot)

    iroa21:/home/edumazet# ./super_netperf 200 -H iroa23 -t TCP_RR -l
20 -- -r40,40
    2648141
    iroa21:/home/edumazet# ./super_netperf 200 -H iroa23 -t TCP_RR -l
20 -- -r44,44
     970691

    We might experiment with bigger GQ_TX_INLINE_HEADER_SIZE in the future ?
   ...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC net] net/mlx5: Fix performance regression for request-response workloads
  2022-09-08 12:41     ` Eric Dumazet
@ 2022-09-26 10:06       ` Alexandra Winter
  2022-09-30 23:37         ` Saeed Mahameed
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandra Winter @ 2022-09-26 10:06 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Miller, Jakub Kicinski, Niklas Schnelle, netdev, linux-s390,
	Heiko Carstens, Christian Borntraeger, Eric Dumazet



On 08.09.22 14:41, Eric Dumazet wrote:
> On Thu, Sep 8, 2022 at 2:40 AM Christian Borntraeger
> <borntraeger@linux.ibm.com> wrote:
>>
>> Am 07.09.22 um 18:06 schrieb Eric Dumazet:
>>> On Wed, Sep 7, 2022 at 5:26 AM Alexandra Winter <wintera@linux.ibm.com> wrote:
>>>>
>>>> Since linear payload was removed even for single small messages,
>>>> an additional page is required and we are measuring performance impact.
>>>>
>>>> 3613b3dbd1ad ("tcp: prepare skbs for better sack shifting")
>>>> explicitely allowed "payload in skb->head for first skb put in the queue,
>>>> to not impact RPC workloads."
>>>> 472c2e07eef0 ("tcp: add one skb cache for tx")
>>>> made that obsolete and removed it.
>>>> When
>>>> d8b81175e412 ("tcp: remove sk_{tr}x_skb_cache")
>>>> reverted it, this piece was not reverted and not added back in.
>>>>
>>>> When running uperf with a request-response pattern with 1k payload
>>>> and 250 connections parallel, we measure 13% difference in throughput
>>>> for our PCI based network interfaces since 472c2e07eef0.
>>>> (our IO MMU is sensitive to the number of mapped pages)
>>>
>>>
>>>
>>>>
>>>> Could you please consider allowing linear payload for the first
>>>> skb in queue again? A patch proposal is appended below.
>>>
>>> No.
>>>
>>> Please add a work around in your driver.
>>>
>>> You can increase throughput by 20% by premapping a coherent piece of
>>> memory in which
>>> you can copy small skbs (skb->head included)
>>>
>>> Something like 256 bytes per slot in the TX ring.
>>>
>>
>> FWIW this regression was withthe standard mellanox driver (nothing s390 specific).
> 
> I did not claim this was s390 specific.
> 
> Only IOMMU mode.
> 
> I would rather not add back something which makes TCP stack slower
> (more tests in fast path)
> for the majority of us _not_ using IOMMU.
> 
> In our own tests, this trick of using linear skbs was only helping
> benchmarks, not real workloads.
> 
> Many drivers have to map skb->head a second time if they contain TCP payload,
> thus adding yet another corner case in their fast path.
> 
> - Typical RPC workloads are playing with TCP_NODELAY
> - Typical bulk flows never have empty write queues...
> 
> Really, I do not want this optimization back, this is not worth it.
> 
> Again, a driver knows better if it is using IOMMU and if pathological
> layouts can be optimized
> to non SG ones, and using a pre-dma-map zone will also benefit pure
> TCP ACK packets (which do not have any payload)
> 
> Here is the changelog of a patch I did for our GQ NIC (not yet
> upstreamed, but will be soon)
> 
[...]

Saeed,
As discussed at LPC, could you please consider adding a workaround to the
Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
we are seeing 13% throughput degradation, if 2 pages need to be mapped
instead of 1.

While Eric's ideas sound very promising, just using non-SG in these cases
should be enough to mitigate the performance regression we see.

Thank you in advance.
Alexandra

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC net] net/mlx5: Fix performance regression for request-response workloads
  2022-09-26 10:06       ` [RFC net] net/mlx5: " Alexandra Winter
@ 2022-09-30 23:37         ` Saeed Mahameed
  2022-12-29  8:27           ` Alexandra Winter
  0 siblings, 1 reply; 8+ messages in thread
From: Saeed Mahameed @ 2022-09-30 23:37 UTC (permalink / raw)
  To: Alexandra Winter
  Cc: David Miller, Jakub Kicinski, Niklas Schnelle, netdev, linux-s390,
	Heiko Carstens, Christian Borntraeger, Eric Dumazet

On 26 Sep 12:06, Alexandra Winter wrote:
>

[ ... ]

>[...]
>
>Saeed,
>As discussed at LPC, could you please consider adding a workaround to the
>Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
>we are seeing 13% throughput degradation, if 2 pages need to be mapped
>instead of 1.
>
>While Eric's ideas sound very promising, just using non-SG in these cases
>should be enough to mitigate the performance regression we see.

Hi Alexandra, sorry for the late response.

Yeas linearizing small messages makes sense, but will require some careful
perf testing.

We will do our best to include this in the next kernel release cycle.
I will take it with the mlx5e team next week, everybody is on vacation this
time of year :).

Thanks,
Saeed.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC net] net/mlx5: Fix performance regression for request-response workloads
  2022-09-30 23:37         ` Saeed Mahameed
@ 2022-12-29  8:27           ` Alexandra Winter
  2023-04-27  9:44             ` Alexandra Winter
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandra Winter @ 2022-12-29  8:27 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Miller, Jakub Kicinski, Niklas Schnelle, netdev, linux-s390,
	Heiko Carstens, Christian Borntraeger, Eric Dumazet



On 01.10.22 01:37, Saeed Mahameed wrote:
> On 26 Sep 12:06, Alexandra Winter wrote:
>>
> 
> [ ... ]
> 
>> [...]
>>
>> Saeed,
>> As discussed at LPC, could you please consider adding a workaround to the
>> Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
>> we are seeing 13% throughput degradation, if 2 pages need to be mapped
>> instead of 1.
>>
>> While Eric's ideas sound very promising, just using non-SG in these cases
>> should be enough to mitigate the performance regression we see.
> 
> Hi Alexandra, sorry for the late response.
> 
> Yeas linearizing small messages makes sense, but will require some careful
> perf testing.
> 
> We will do our best to include this in the next kernel release cycle.
> I will take it with the mlx5e team next week, everybody is on vacation this
> time of year :).
> 
> Thanks,
> Saeed.

Hello Saeed,
may I ask whether you had a chance to include such a patch in the 6.2 kernel?
Or is this still on your ToDo list?
I haven't seen anything like this on the mailing list, but I may have overlooked it.
All the best for 2023
Alexandra

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC net] net/mlx5: Fix performance regression for request-response workloads
  2022-12-29  8:27           ` Alexandra Winter
@ 2023-04-27  9:44             ` Alexandra Winter
  0 siblings, 0 replies; 8+ messages in thread
From: Alexandra Winter @ 2023-04-27  9:44 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David Miller, Jakub Kicinski, Niklas Schnelle, netdev, linux-s390,
	Heiko Carstens, Christian Borntraeger, Eric Dumazet, Halil Pasic


On 29.12.22 09:27, Alexandra Winter wrote:
> 
> 
> On 01.10.22 01:37, Saeed Mahameed wrote:
>> On 26 Sep 12:06, Alexandra Winter wrote:
>>
>> [ ... ]
>>>
>>> Saeed,
>>> As discussed at LPC, could you please consider adding a workaround to the
>>> Mellanox driver, to use non-SG SKBs for small messages? As mentioned above
>>> we are seeing 13% throughput degradation, if 2 pages need to be mapped
>>> instead of 1.
>>>
>>> While Eric's ideas sound very promising, just using non-SG in these cases
>>> should be enough to mitigate the performance regression we see.
>>
>> Hi Alexandra, sorry for the late response.
>>
>> Yeas linearizing small messages makes sense, but will require some careful
>> perf testing.
>>
>> We will do our best to include this in the next kernel release cycle.
>> I will take it with the mlx5e team next week, everybody is on vacation this
>> time of year :).
>>
>> Thanks,
>> Saeed.
> 
> Hello Saeed,
> may I ask whether you had a chance to include such a patch in the 6.2 kernel?
> Or is this still on your ToDo list?
> I haven't seen anything like this on the mailing list, but I may have overlooked it.
> All the best for 2023
> Alexandra


Hello Saeed,
any news about linearizing small messages? Is there any way we could be of help?

Kind regards
Alexandra

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-04-27  9:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-09-07 12:25 [RFC net] tcp: Fix performance regression for request-response workloads Alexandra Winter
2022-09-07 16:06 ` Eric Dumazet
2022-09-08  9:40   ` Christian Borntraeger
2022-09-08 12:41     ` Eric Dumazet
2022-09-26 10:06       ` [RFC net] net/mlx5: " Alexandra Winter
2022-09-30 23:37         ` Saeed Mahameed
2022-12-29  8:27           ` Alexandra Winter
2023-04-27  9:44             ` Alexandra Winter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).