netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* AMD IOMMU problem after NIC uses multi-page allocation
@ 2023-03-30  1:14 Jakub Kicinski
  2023-03-30  2:36 ` Yunsheng Lin
  2023-03-30  7:41 ` Joerg Roedel
  0 siblings, 2 replies; 7+ messages in thread
From: Jakub Kicinski @ 2023-03-30  1:14 UTC (permalink / raw)
  To: Joerg Roedel, Suravee Suthikulpanit
  Cc: iommu, netdev@vger.kernel.org, Willem de Bruijn, Saeed Mahameed

Hi Joerg, Suravee,

I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19).

The NIC allocates a buffer for Rx packets which is MTU rounded up 
to page size. If I run it with 1500B MTU or 9000 MTU everything is
fine, slight but manageable perf hit.

But if I flip the MTU to 9k, run some traffic and then go back to 1.5k 
- 70%+ of CPU cycles are spent in alloc_iova (and children).

Does this ring any bells?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: AMD IOMMU problem after NIC uses multi-page allocation
  2023-03-30  1:14 AMD IOMMU problem after NIC uses multi-page allocation Jakub Kicinski
@ 2023-03-30  2:36 ` Yunsheng Lin
  2023-03-30  7:41 ` Joerg Roedel
  1 sibling, 0 replies; 7+ messages in thread
From: Yunsheng Lin @ 2023-03-30  2:36 UTC (permalink / raw)
  To: Jakub Kicinski, Joerg Roedel, Suravee Suthikulpanit
  Cc: iommu, netdev@vger.kernel.org, Willem de Bruijn, Saeed Mahameed

On 2023/3/30 9:14, Jakub Kicinski wrote:
> Hi Joerg, Suravee,
> 
> I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19).
> 
> The NIC allocates a buffer for Rx packets which is MTU rounded up 
> to page size. If I run it with 1500B MTU or 9000 MTU everything is
> fine, slight but manageable perf hit.
> 
> But if I flip the MTU to 9k, run some traffic and then go back to 1.5k 
> - 70%+ of CPU cycles are spent in alloc_iova (and children).
> 
> Does this ring any bells?

My bell points to below info, not sure if it will help?
https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leizhen@huawei.com/

> .
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: AMD IOMMU problem after NIC uses multi-page allocation
  2023-03-30  1:14 AMD IOMMU problem after NIC uses multi-page allocation Jakub Kicinski
  2023-03-30  2:36 ` Yunsheng Lin
@ 2023-03-30  7:41 ` Joerg Roedel
  2023-03-30 12:07   ` Vasant Hegde
  2023-03-30 13:04   ` Robin Murphy
  1 sibling, 2 replies; 7+ messages in thread
From: Joerg Roedel @ 2023-03-30  7:41 UTC (permalink / raw)
  To: Jakub Kicinski, Vasant Hegde, Robin Murphy
  Cc: Suravee Suthikulpanit, iommu, netdev@vger.kernel.org,
	Willem de Bruijn, Saeed Mahameed

Also adding Vasant and Robin.

Vasant, Robin, any idea?

On Wed, Mar 29, 2023 at 06:14:07PM -0700, Jakub Kicinski wrote:
> Hi Joerg, Suravee,
> 
> I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19).
> 
> The NIC allocates a buffer for Rx packets which is MTU rounded up 
> to page size. If I run it with 1500B MTU or 9000 MTU everything is
> fine, slight but manageable perf hit.
> 
> But if I flip the MTU to 9k, run some traffic and then go back to 1.5k 
> - 70%+ of CPU cycles are spent in alloc_iova (and children).
> 
> Does this ring any bells?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: AMD IOMMU problem after NIC uses multi-page allocation
  2023-03-30  7:41 ` Joerg Roedel
@ 2023-03-30 12:07   ` Vasant Hegde
  2023-03-30 13:04   ` Robin Murphy
  1 sibling, 0 replies; 7+ messages in thread
From: Vasant Hegde @ 2023-03-30 12:07 UTC (permalink / raw)
  To: Joerg Roedel, Jakub Kicinski, Robin Murphy
  Cc: Suravee Suthikulpanit, iommu, netdev@vger.kernel.org,
	Willem de Bruijn, Saeed Mahameed

Hi Jakub,

On 3/30/2023 1:11 PM, Joerg Roedel wrote:
> Also adding Vasant and Robin.
> 
> Vasant, Robin, any idea?
> 

I tried few things on my Milan system. I can't reproduce the issue.

Can you please try the patch mentioned by Yunsheng?

If its still an issue, can you please provide steps to reproduce? I will take a
look.

-Vasant


> On Wed, Mar 29, 2023 at 06:14:07PM -0700, Jakub Kicinski wrote:
>> Hi Joerg, Suravee,
>>
>> I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19).
>>
>> The NIC allocates a buffer for Rx packets which is MTU rounded up 
>> to page size. If I run it with 1500B MTU or 9000 MTU everything is
>> fine, slight but manageable perf hit.
>>
>> But if I flip the MTU to 9k, run some traffic and then go back to 1.5k 
>> - 70%+ of CPU cycles are spent in alloc_iova (and children).
>>
>> Does this ring any bells?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: AMD IOMMU problem after NIC uses multi-page allocation
  2023-03-30  7:41 ` Joerg Roedel
  2023-03-30 12:07   ` Vasant Hegde
@ 2023-03-30 13:04   ` Robin Murphy
  2023-03-30 13:10     ` Robin Murphy
  1 sibling, 1 reply; 7+ messages in thread
From: Robin Murphy @ 2023-03-30 13:04 UTC (permalink / raw)
  To: Joerg Roedel, Jakub Kicinski, Vasant Hegde
  Cc: Suravee Suthikulpanit, iommu, netdev@vger.kernel.org,
	Willem de Bruijn, Saeed Mahameed

On 2023-03-30 08:41, Joerg Roedel wrote:
> Also adding Vasant and Robin.
> 
> Vasant, Robin, any idea?
> 
> On Wed, Mar 29, 2023 at 06:14:07PM -0700, Jakub Kicinski wrote:
>> Hi Joerg, Suravee,
>>
>> I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19).
>>
>> The NIC allocates a buffer for Rx packets which is MTU rounded up
>> to page size. If I run it with 1500B MTU or 9000 MTU everything is
>> fine, slight but manageable perf hit.
>>
>> But if I flip the MTU to 9k, run some traffic and then go back to 1.5k
>> - 70%+ of CPU cycles are spent in alloc_iova (and children).
>>
>> Does this ring any bells?

There is that old issue already mentioned where there seems to be some 
interplay between the IOVA caching and the lazy flush queue, which we 
never really managed to get to the bottom of. IIRC my hunch was that 
with a sufficiently large number of CPUs, fq_flush_timeout() overwhelms 
the rcache depot and gets into a pathological state where it then 
continually thrashes the IOVA rbtree in a fight with the caching system.

Another (simpler) possibility which comes to mind is if the 9K MTU 
(which I guess means 16KB IOVA allocations) puts you up against the 
threshold of available 32-bit IOVA space - if you keep using the 16K 
entries then you'll mostly be recycling them out of the IOVA caches, 
which is nice and fast. However once you switch back to 1500 so needing 
2KB IOVAs, you've now got a load of IOVA space hogged by all the 16KB 
entries that are now hanging around in caches, which could push you into 
the case where the optimistic 32-bit allocation starts to fail (but 
because it *can* fall back to a 64-bit allocation, it's not going to 
purge those unused 16KB entries to free up more 32-bit space). If the 
32-bit space then *stays* full, alloc_iova should stay in fail-fast 
mode, but if some 2KB allocations were below 32 bits and eventually get 
freed back to the tree, then subsequent attempts are liable to spend 
ages doing doing their best to scrape up all the available 32-bit space 
until it's definitely full again. For that case, [1] should help.

Even in the second case, though, I think hitting the rbtree much at all 
still implies that the caches might not be well-matched to the 
workload's map/unmap pattern, and maybe scaling up the depot size could 
still be the biggest win.

Thanks,
Robin.

[1] 
https://lore.kernel.org/linux-iommu/e9abc601b00e26fd15a583fcd55f2a8227903077.1674061620.git.robin.murphy@arm.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: AMD IOMMU problem after NIC uses multi-page allocation
  2023-03-30 13:04   ` Robin Murphy
@ 2023-03-30 13:10     ` Robin Murphy
  2023-03-31  4:06       ` Jakub Kicinski
  0 siblings, 1 reply; 7+ messages in thread
From: Robin Murphy @ 2023-03-30 13:10 UTC (permalink / raw)
  To: Joerg Roedel, Jakub Kicinski, Vasant Hegde
  Cc: Suravee Suthikulpanit, iommu, netdev@vger.kernel.org,
	Willem de Bruijn, Saeed Mahameed

On 2023-03-30 14:04, Robin Murphy wrote:
> On 2023-03-30 08:41, Joerg Roedel wrote:
>> Also adding Vasant and Robin.
>>
>> Vasant, Robin, any idea?
>>
>> On Wed, Mar 29, 2023 at 06:14:07PM -0700, Jakub Kicinski wrote:
>>> Hi Joerg, Suravee,
>>>
>>> I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19).
>>>
>>> The NIC allocates a buffer for Rx packets which is MTU rounded up
>>> to page size. If I run it with 1500B MTU or 9000 MTU everything is
>>> fine, slight but manageable perf hit.
>>>
>>> But if I flip the MTU to 9k, run some traffic and then go back to 1.5k
>>> - 70%+ of CPU cycles are spent in alloc_iova (and children).
>>>
>>> Does this ring any bells?
> 
> There is that old issue already mentioned where there seems to be some 
> interplay between the IOVA caching and the lazy flush queue, which we 
> never really managed to get to the bottom of. IIRC my hunch was that 
> with a sufficiently large number of CPUs, fq_flush_timeout() overwhelms 
> the rcache depot and gets into a pathological state where it then 
> continually thrashes the IOVA rbtree in a fight with the caching system.
> 
> Another (simpler) possibility which comes to mind is if the 9K MTU 
> (which I guess means 16KB IOVA allocations) puts you up against the 
> threshold of available 32-bit IOVA space - if you keep using the 16K 
> entries then you'll mostly be recycling them out of the IOVA caches, 
> which is nice and fast. However once you switch back to 1500 so needing 
> 2KB IOVAs, you've now got a load of IOVA space hogged by all the 16KB 
> entries that are now hanging around in caches, which could push you into 
> the case where the optimistic 32-bit allocation starts to fail (but 
> because it *can* fall back to a 64-bit allocation, it's not going to 
> purge those unused 16KB entries to free up more 32-bit space). If the 
> 32-bit space then *stays* full, alloc_iova should stay in fail-fast 
> mode, but if some 2KB allocations were below 32 bits and eventually get 
> freed back to the tree, then subsequent attempts are liable to spend 
> ages doing doing their best to scrape up all the available 32-bit space 
> until it's definitely full again. For that case, [1] should help.

...where by "2KB" I obviously mean 4KB, since apparently in remembering 
that the caches round up to powers of two I managed to forget that 
that's still in units of IOVA pages, derp.

Robin.

> 
> Even in the second case, though, I think hitting the rbtree much at all 
> still implies that the caches might not be well-matched to the 
> workload's map/unmap pattern, and maybe scaling up the depot size could 
> still be the biggest win.
> 
> Thanks,
> Robin.
> 
> [1] 
> https://lore.kernel.org/linux-iommu/e9abc601b00e26fd15a583fcd55f2a8227903077.1674061620.git.robin.murphy@arm.com/
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: AMD IOMMU problem after NIC uses multi-page allocation
  2023-03-30 13:10     ` Robin Murphy
@ 2023-03-31  4:06       ` Jakub Kicinski
  0 siblings, 0 replies; 7+ messages in thread
From: Jakub Kicinski @ 2023-03-31  4:06 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Joerg Roedel, Vasant Hegde, Suravee Suthikulpanit, iommu,
	netdev@vger.kernel.org, Willem de Bruijn, Saeed Mahameed

On Thu, 30 Mar 2023 14:10:09 +0100 Robin Murphy wrote:
> > There is that old issue already mentioned where there seems to be some 
> > interplay between the IOVA caching and the lazy flush queue, which we 
> > never really managed to get to the bottom of. IIRC my hunch was that 
> > with a sufficiently large number of CPUs, fq_flush_timeout() overwhelms 
> > the rcache depot and gets into a pathological state where it then 
> > continually thrashes the IOVA rbtree in a fight with the caching system.
> > 
> > Another (simpler) possibility which comes to mind is if the 9K MTU 
> > (which I guess means 16KB IOVA allocations) puts you up against the 
> > threshold of available 32-bit IOVA space - if you keep using the 16K 
> > entries then you'll mostly be recycling them out of the IOVA caches, 
> > which is nice and fast. However once you switch back to 1500 so needing 
> > 2KB IOVAs, you've now got a load of IOVA space hogged by all the 16KB 
> > entries that are now hanging around in caches, which could push you into 
> > the case where the optimistic 32-bit allocation starts to fail (but 
> > because it *can* fall back to a 64-bit allocation, it's not going to 
> > purge those unused 16KB entries to free up more 32-bit space). If the 
> > 32-bit space then *stays* full, alloc_iova should stay in fail-fast 
> > mode, but if some 2KB allocations were below 32 bits and eventually get 
> > freed back to the tree, then subsequent attempts are liable to spend 
> > ages doing doing their best to scrape up all the available 32-bit space 
> > until it's definitely full again. For that case, [1] should help.  
> 
> ...where by "2KB" I obviously mean 4KB, since apparently in remembering 
> that the caches round up to powers of two I managed to forget that 
> that's still in units of IOVA pages, derp.
> 
> Robin.
> 
> > 
> > Even in the second case, though, I think hitting the rbtree much at all 
> > still implies that the caches might not be well-matched to the 
> > workload's map/unmap pattern, and maybe scaling up the depot size could 
> > still be the biggest win.
> > 
> > Thanks,
> > Robin.
> > 
> > [1] 
> > https://lore.kernel.org/linux-iommu/e9abc601b00e26fd15a583fcd55f2a8227903077.1674061620.git.robin.murphy@arm.com/

Alright, can confirm! :) 
That patch on top of Linus's tree fixes the issue for me!

Noob question about large systems, if you indulge me - I run into this
after enabling the IOMMU driver to get large (255+ thread) AMD machines
to work. Is there a general dependency on IOMMU for such x86 systems or
the tie between IOMMU and x2apic is AMD-specific? Or I'm completely
confused?

I couldn't find anything in the kernel docs and I'm trying to wrap my
head around getting the kernel to work the same across a heterogeneous*
fleet of machines (* in terms of vendor and CPU count).

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-03-31  4:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-30  1:14 AMD IOMMU problem after NIC uses multi-page allocation Jakub Kicinski
2023-03-30  2:36 ` Yunsheng Lin
2023-03-30  7:41 ` Joerg Roedel
2023-03-30 12:07   ` Vasant Hegde
2023-03-30 13:04   ` Robin Murphy
2023-03-30 13:10     ` Robin Murphy
2023-03-31  4:06       ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).