public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed
From: Dragos Tatulea <dtatulea@nvidia.com>
To: Alexandra Winter <wintera@linux.ibm.com>,
	Alexander Lobakin <aleksander.lobakin@intel.com>
Cc: Rahul Rameshbabu <rrameshbabu@nvidia.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Tariq Toukan <tariqt@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>,
	David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Eric Dumazet <edumazet@google.com>,
	Andrew Lunn <andrew+netdev@lunn.ch>,
	Nils Hoppmann <niho@linux.ibm.com>,
	netdev@vger.kernel.org, linux-s390@vger.kernel.org,
	Heiko Carstens <hca@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Thorsten Winkler <twinkler@linux.ibm.com>,
	Simon Horman <horms@kernel.org>,
	Niklas Schnelle <schnelle@linux.ibm.com>
Subject: Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
Date: Wed, 11 Dec 2024 18:28:19 +0100	[thread overview]
Message-ID: <89c67aa3-4b84-45c1-9e7f-a608957d5aeb@nvidia.com> (raw)
In-Reply-To: <54738182-3438-4a58-8c33-27dc20a4b3fe@linux.ibm.com>

[-- Attachment #1: Type: text/plain, Size: 7060 bytes --]

Hi Alexandra,

On 11.12.24 14:35, Alexandra Winter wrote:
> 
> 
> On 10.12.24 14:54, Alexander Lobakin wrote:
>> From: Dragos Tatulea <dtatulea@nvidia.com>
>> Date: Tue, 10 Dec 2024 12:44:04 +0100
>>
>>>
>>>
>>> On 06.12.24 16:20, Alexandra Winter wrote:
>>>>
>>>>
>>>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>>>  {
>>>>>>  	struct mlx5e_sq_stats *stats = sq->stats;
>>>>>>  
>>>>>> +	/* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>>>> +	if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>>>    +		skb_linearize(skb);
>>>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>>>
>>>>
>>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>>>> Any opinions from the NVidia people?
>>>>
>>> Agreed.
>>>
>>>>
>>>>> 2. Why truesize, not something like
>>>>>
>>>>> 	if (skb->len <= some_sane_value_maybe_1k)
>>>>
>>>>
>>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>>>> When we set the threshhold at a smaller value, skb->len makes more sense
>>>>
>>>>
>>>>>
>>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>>>    it's a good idea to rely on this.
>>>>>    Some test-based hardcode would be enough (i.e. threshold on which
>>>>>    DMA mapping starts performing better).
>>>>
>>>>
>>>> A threshhold of 4k is absolutely fine with us (s390). 
>>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>>>
>>>>
>>>> NVidia people do you have any opinion on a good threshhold?
>>>>
> 
> On 09.12.24 12:36, Tariq Toukan wrote:
>> Hi,
>>
>> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
>>
>> In many cases, copy up to PAGE_SIZE means copy everything.
>> For high NIC speeds this is not realistic.
>>
>> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
> 
>>> 1KB is still to large. As Tariq mentioned, the threshold should not
>>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>>> regressions but I need to play with it more.
I checked on a x86 system with CX7 and we seem to get ~4% degradation
when using this approach. The patch was modified a bit according to
previous discussions (diff at end of mail).

Here's how I tested:
- uperf client side has many queues.
- uperf server side has single queue with interrupts pinned to a single
  CPU. This is to better isolate CPU behaviour. The idea is to have the
  CPU on the server side saturated or close to saturation.
- Used the uperf 1B read x 1B write scenario with 50 and 100 threads
  (profile attached).
  Both the on-cpu and off-cpu cases were checked.
- Code change was done only on server side.

The results:
```
Case:                                          Throughput   Operations
----------------------------------------------------------------------
app cpu == irq cpu, nthreads= 25, baseline     3.86Mb/s     30036552  
app cpu == irq cpu, nthreads= 25, skb_linear   3.52Mb/s     26969315  

app cpu == irq cpu, nthreads= 50, baseline     4.26Mb/s     33122458 
app cpu == irq cpu, nthreads= 50, skb_linear   4.06Mb/s     31606290 

app cpu == irq cpu, nthreads=100, baseline     4.08Mb/s     31775434  
app cpu == irq cpu, nthreads=100, skb_linear   3.86Mb/s     30105582

app cpu != irq cpu, nthreads= 25, baseline     3.57Mb/s     27785488
app cpu != irq cpu, nthreads= 25, skb_linear   3.56Mb/s     27730133

app cpu != irq cpu, nthreads= 50, baseline     3.97Mb/s     30876264
app cpu != irq cpu, nthreads= 50, skb_linear   3.82Mb/s     29737654

app cpu != irq cpu, nthreads=100, baseline     4.06Mb/s     31621140
app cpu != irq cpu, nthreads=100, skb_linear   3.90Mb/s     30364382
```

So not encouraging. I restricted the tests to 1B payloads only as I
expected worse perf for larger payloads.

>>
>> On different setups, usually the copybreak of 192 or 256 bytes was the
>> most efficient as well.
>>
>>>
> 
> 
> Thank you very much, everybody for looking into this.
> 
> Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390
> and with this patch we get up to 12% more transactions/sec even for 1k messages.
> 
> As we're always using an IOMMU and are operating with large multi socket systems,
> DMA costs far outweigh the costs of small to medium memcpy()s on our system.
> We realize that the recommendation is to just run without IOMMU when performance is important,
> but this is not an option in our environment.
> 
> I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a
> strategic direction, it is more a simple mitigation of our problem.
> 
> Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements.
> We could even restrict it to S390 arch, if you want.
>
Maybe Tariq can comment on this.

> A threshold of 256b would cover some amount of our typical request-response workloads
> (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k).
> Could we maybe define an architecture dependent threshold?
> 
> My preferred scenario for the next steps would be the following:
> 1) It would be great if we could get a simple mitigation patch upstream, that the distros could
> easily backport. This would avoid our customers experiencing performance regression when they
> upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)
>
Stupid question on my behalf: why can't this patch be taken as a distro
patch for s390 and carried over over releases? This way the kernel
upgrade pain would be avoided.

> 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed).
> This would then replace the simple mitigation patch upstream, and future releases would contain 
> that solution. If somebody else wants to work on this improved version, fine with me, otherwise
> I could give it a try.
> 
For the inline WQE solution I don't think we have a large amout of space
to copy so much data into. For Eric's side buffer suggestion it will be
even more invasive and it will touch many more code paths...

> What do you think of this approach?
> 
>
Sounds tricky. Let's see what Tariq has to say.

Thanks,
Dragos

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f8c7912abe0e..cc947daa538b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -269,6 +269,9 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
 {
        struct mlx5e_sq_stats *stats = sq->stats;
 
+       if (skb->len < 256)
+               skb_linearize(skb);
+
        if (skb_is_gso(skb)) {
                int hopbyhop;
                u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);

[-- Attachment #2: rr1c-1x1.xml --]
[-- Type: text/xml, Size: 522 bytes --]

<?xml version="1.0"?>
<profile name="netperf">
  <group nthreads="$nthreads">
        <transaction iterations="$iter">
            <flowop type="accept" options="remotehost=$h1 protocol=$proto tcp_nodelay"/>
        </transaction>
        <transaction duration="$dur">
            <flowop type="read" options="size=1"/>
            <flowop type="write" options="size=1"/>
        </transaction>
        <transaction iterations="$iter">
            <flowop type="disconnect" />
        </transaction>
  </group>
</profile>

  reply	other threads:[~2024-12-11 17:28 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-04 14:02 [PATCH net-next] net/mlx5e: Transmit small messages in linear skb Alexandra Winter
2024-12-04 14:16 ` Eric Dumazet
2024-12-04 14:35   ` Alexandra Winter
2024-12-04 14:36   ` Eric Dumazet
2024-12-06 14:47     ` David Laight
2024-12-06 16:35       ` Eric Dumazet
2024-12-06 15:25     ` Alexandra Winter
2024-12-10 11:49       ` Dragos Tatulea
2024-12-11 16:19         ` Alexandra Winter
2024-12-11 17:36           ` Dragos Tatulea
2024-12-04 14:32 ` Alexander Lobakin
2024-12-06 15:20   ` Alexandra Winter
2024-12-09 11:36     ` Tariq Toukan
2024-12-10 11:44     ` Dragos Tatulea
2024-12-10 13:54       ` Alexander Lobakin
2024-12-10 17:10         ` Joe Damato
2024-12-11 13:35         ` Alexandra Winter
2024-12-11 17:28           ` Dragos Tatulea [this message]
2024-12-11 17:50             ` Niklas Schnelle
2024-12-13 20:41               ` Dragos Tatulea
2024-12-12 10:36             ` Christian Borntraeger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=89c67aa3-4b84-45c1-9e7f-a608957d5aeb@nvidia.com \
    --to=dtatulea@nvidia.com \
    --cc=agordeev@linux.ibm.com \
    --cc=aleksander.lobakin@intel.com \
    --cc=andrew+netdev@lunn.ch \
    --cc=borntraeger@linux.ibm.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=horms@kernel.org \
    --cc=kuba@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=niho@linux.ibm.com \
    --cc=pabeni@redhat.com \
    --cc=rrameshbabu@nvidia.com \
    --cc=saeedm@nvidia.com \
    --cc=schnelle@linux.ibm.com \
    --cc=svens@linux.ibm.com \
    --cc=tariqt@nvidia.com \
    --cc=twinkler@linux.ibm.com \
    --cc=wintera@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox