From: Dragos Tatulea <dtatulea@nvidia.com>
To: Alexandra Winter <wintera@linux.ibm.com>,
Alexander Lobakin <aleksander.lobakin@intel.com>
Cc: Rahul Rameshbabu <rrameshbabu@nvidia.com>,
Saeed Mahameed <saeedm@nvidia.com>,
Tariq Toukan <tariqt@nvidia.com>,
Leon Romanovsky <leon@kernel.org>,
David Miller <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Eric Dumazet <edumazet@google.com>,
Andrew Lunn <andrew+netdev@lunn.ch>,
Nils Hoppmann <niho@linux.ibm.com>,
netdev@vger.kernel.org, linux-s390@vger.kernel.org,
Heiko Carstens <hca@linux.ibm.com>,
Vasily Gorbik <gor@linux.ibm.com>,
Alexander Gordeev <agordeev@linux.ibm.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Sven Schnelle <svens@linux.ibm.com>,
Thorsten Winkler <twinkler@linux.ibm.com>,
Simon Horman <horms@kernel.org>,
Niklas Schnelle <schnelle@linux.ibm.com>
Subject: Re: [PATCH net-next] net/mlx5e: Transmit small messages in linear skb
Date: Wed, 11 Dec 2024 18:28:19 +0100 [thread overview]
Message-ID: <89c67aa3-4b84-45c1-9e7f-a608957d5aeb@nvidia.com> (raw)
In-Reply-To: <54738182-3438-4a58-8c33-27dc20a4b3fe@linux.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 7060 bytes --]
Hi Alexandra,
On 11.12.24 14:35, Alexandra Winter wrote:
>
>
> On 10.12.24 14:54, Alexander Lobakin wrote:
>> From: Dragos Tatulea <dtatulea@nvidia.com>
>> Date: Tue, 10 Dec 2024 12:44:04 +0100
>>
>>>
>>>
>>> On 06.12.24 16:20, Alexandra Winter wrote:
>>>>
>>>>
>>>> On 04.12.24 15:32, Alexander Lobakin wrote:
>>>>>> @@ -269,6 +270,10 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>>>>>> {
>>>>>> struct mlx5e_sq_stats *stats = sq->stats;
>>>>>>
>>>>>> + /* Don't require 2 IOMMU TLB entries, if one is sufficient */
>>>>>> + if (use_dma_iommu(sq->pdev) && skb->truesize <= PAGE_SIZE)
>>>> + skb_linearize(skb);
>>>>> 1. What's with the direct DMA? I believe it would benefit, too?
>>>>
>>>>
>>>> Removing the use_dma_iommu check is fine with us (s390). It is just a proposal to reduce the impact.
>>>> Any opinions from the NVidia people?
>>>>
>>> Agreed.
>>>
>>>>
>>>>> 2. Why truesize, not something like
>>>>>
>>>>> if (skb->len <= some_sane_value_maybe_1k)
>>>>
>>>>
>>>> With (skb->truesize <= PAGE_SIZE) the whole "head" buffer fits into 1 page.
>>>> When we set the threshhold at a smaller value, skb->len makes more sense
>>>>
>>>>
>>>>>
>>>>> 3. As Eric mentioned, PAGE_SIZE can be up to 256 Kb, I don't think
>>>>> it's a good idea to rely on this.
>>>>> Some test-based hardcode would be enough (i.e. threshold on which
>>>>> DMA mapping starts performing better).
>>>>
>>>>
>>>> A threshhold of 4k is absolutely fine with us (s390).
>>>> A threshhold of 1k would definitvely improve our situation and bring back the performance for some important scenarios.
>>>>
>>>>
>>>> NVidia people do you have any opinion on a good threshhold?
>>>>
>
> On 09.12.24 12:36, Tariq Toukan wrote:
>> Hi,
>>
>> Many approaches in the past few years are going the opposite direction, trying to avoid copies ("zero-copy").
>>
>> In many cases, copy up to PAGE_SIZE means copy everything.
>> For high NIC speeds this is not realistic.
>>
>> Anyway, based on past experience, threshold should not exceed "max header size" (128/256b).
>
>>> 1KB is still to large. As Tariq mentioned, the threshold should not
>>> exceed 128/256B. I am currently testing this with 256B on x86. So far no
>>> regressions but I need to play with it more.
I checked on a x86 system with CX7 and we seem to get ~4% degradation
when using this approach. The patch was modified a bit according to
previous discussions (diff at end of mail).
Here's how I tested:
- uperf client side has many queues.
- uperf server side has single queue with interrupts pinned to a single
CPU. This is to better isolate CPU behaviour. The idea is to have the
CPU on the server side saturated or close to saturation.
- Used the uperf 1B read x 1B write scenario with 50 and 100 threads
(profile attached).
Both the on-cpu and off-cpu cases were checked.
- Code change was done only on server side.
The results:
```
Case: Throughput Operations
----------------------------------------------------------------------
app cpu == irq cpu, nthreads= 25, baseline 3.86Mb/s 30036552
app cpu == irq cpu, nthreads= 25, skb_linear 3.52Mb/s 26969315
app cpu == irq cpu, nthreads= 50, baseline 4.26Mb/s 33122458
app cpu == irq cpu, nthreads= 50, skb_linear 4.06Mb/s 31606290
app cpu == irq cpu, nthreads=100, baseline 4.08Mb/s 31775434
app cpu == irq cpu, nthreads=100, skb_linear 3.86Mb/s 30105582
app cpu != irq cpu, nthreads= 25, baseline 3.57Mb/s 27785488
app cpu != irq cpu, nthreads= 25, skb_linear 3.56Mb/s 27730133
app cpu != irq cpu, nthreads= 50, baseline 3.97Mb/s 30876264
app cpu != irq cpu, nthreads= 50, skb_linear 3.82Mb/s 29737654
app cpu != irq cpu, nthreads=100, baseline 4.06Mb/s 31621140
app cpu != irq cpu, nthreads=100, skb_linear 3.90Mb/s 30364382
```
So not encouraging. I restricted the tests to 1B payloads only as I
expected worse perf for larger payloads.
>>
>> On different setups, usually the copybreak of 192 or 256 bytes was the
>> most efficient as well.
>>
>>>
>
>
> Thank you very much, everybody for looking into this.
>
> Unfortunately we are seeing these performance regressions with ConnectX-6 cards on s390
> and with this patch we get up to 12% more transactions/sec even for 1k messages.
>
> As we're always using an IOMMU and are operating with large multi socket systems,
> DMA costs far outweigh the costs of small to medium memcpy()s on our system.
> We realize that the recommendation is to just run without IOMMU when performance is important,
> but this is not an option in our environment.
>
> I understand that the simple patch of calling skb_linearize() in mlx5 for small messages is not a
> strategic direction, it is more a simple mitigation of our problem.
>
> Whether it should be restricted to use_dma_iommu() or not is up to you and your measurements.
> We could even restrict it to S390 arch, if you want.
>
Maybe Tariq can comment on this.
> A threshold of 256b would cover some amount of our typical request-response workloads
> (think database queries/updates), but we would prefer a higher number (e.g. 1k or 2k).
> Could we maybe define an architecture dependent threshold?
>
> My preferred scenario for the next steps would be the following:
> 1) It would be great if we could get a simple mitigation patch upstream, that the distros could
> easily backport. This would avoid our customers experiencing performance regression when they
> upgrade their distro versions. (e.g. from RHEL8 to RHEL9 or RHEL10 just as an example)
>
Stupid question on my behalf: why can't this patch be taken as a distro
patch for s390 and carried over over releases? This way the kernel
upgrade pain would be avoided.
> 2) Then we could work on a more invasive solution. (see proposals by Eric, Dragos/Saeed).
> This would then replace the simple mitigation patch upstream, and future releases would contain
> that solution. If somebody else wants to work on this improved version, fine with me, otherwise
> I could give it a try.
>
For the inline WQE solution I don't think we have a large amout of space
to copy so much data into. For Eric's side buffer suggestion it will be
even more invasive and it will touch many more code paths...
> What do you think of this approach?
>
>
Sounds tricky. Let's see what Tariq has to say.
Thanks,
Dragos
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index f8c7912abe0e..cc947daa538b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -269,6 +269,9 @@ static void mlx5e_sq_xmit_prepare(struct mlx5e_txqsq *sq, struct sk_buff *skb,
{
struct mlx5e_sq_stats *stats = sq->stats;
+ if (skb->len < 256)
+ skb_linearize(skb);
+
if (skb_is_gso(skb)) {
int hopbyhop;
u16 ihs = mlx5e_tx_get_gso_ihs(sq, skb, &hopbyhop);
[-- Attachment #2: rr1c-1x1.xml --]
[-- Type: text/xml, Size: 522 bytes --]
<?xml version="1.0"?>
<profile name="netperf">
<group nthreads="$nthreads">
<transaction iterations="$iter">
<flowop type="accept" options="remotehost=$h1 protocol=$proto tcp_nodelay"/>
</transaction>
<transaction duration="$dur">
<flowop type="read" options="size=1"/>
<flowop type="write" options="size=1"/>
</transaction>
<transaction iterations="$iter">
<flowop type="disconnect" />
</transaction>
</group>
</profile>
next prev parent reply other threads:[~2024-12-11 17:28 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-04 14:02 [PATCH net-next] net/mlx5e: Transmit small messages in linear skb Alexandra Winter
2024-12-04 14:16 ` Eric Dumazet
2024-12-04 14:35 ` Alexandra Winter
2024-12-04 14:36 ` Eric Dumazet
2024-12-06 14:47 ` David Laight
2024-12-06 16:35 ` Eric Dumazet
2024-12-06 15:25 ` Alexandra Winter
2024-12-10 11:49 ` Dragos Tatulea
2024-12-11 16:19 ` Alexandra Winter
2024-12-11 17:36 ` Dragos Tatulea
2024-12-04 14:32 ` Alexander Lobakin
2024-12-06 15:20 ` Alexandra Winter
2024-12-09 11:36 ` Tariq Toukan
2024-12-10 11:44 ` Dragos Tatulea
2024-12-10 13:54 ` Alexander Lobakin
2024-12-10 17:10 ` Joe Damato
2024-12-11 13:35 ` Alexandra Winter
2024-12-11 17:28 ` Dragos Tatulea [this message]
2024-12-11 17:50 ` Niklas Schnelle
2024-12-13 20:41 ` Dragos Tatulea
2024-12-12 10:36 ` Christian Borntraeger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=89c67aa3-4b84-45c1-9e7f-a608957d5aeb@nvidia.com \
--to=dtatulea@nvidia.com \
--cc=agordeev@linux.ibm.com \
--cc=aleksander.lobakin@intel.com \
--cc=andrew+netdev@lunn.ch \
--cc=borntraeger@linux.ibm.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=gor@linux.ibm.com \
--cc=hca@linux.ibm.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=leon@kernel.org \
--cc=linux-s390@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=niho@linux.ibm.com \
--cc=pabeni@redhat.com \
--cc=rrameshbabu@nvidia.com \
--cc=saeedm@nvidia.com \
--cc=schnelle@linux.ibm.com \
--cc=svens@linux.ibm.com \
--cc=tariqt@nvidia.com \
--cc=twinkler@linux.ibm.com \
--cc=wintera@linux.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox