* AMD IOMMU problem after NIC uses multi-page allocation @ 2023-03-30 1:14 Jakub Kicinski 2023-03-30 2:36 ` Yunsheng Lin 2023-03-30 7:41 ` Joerg Roedel 0 siblings, 2 replies; 7+ messages in thread From: Jakub Kicinski @ 2023-03-30 1:14 UTC (permalink / raw) To: Joerg Roedel, Suravee Suthikulpanit Cc: iommu, netdev@vger.kernel.org, Willem de Bruijn, Saeed Mahameed Hi Joerg, Suravee, I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19). The NIC allocates a buffer for Rx packets which is MTU rounded up to page size. If I run it with 1500B MTU or 9000 MTU everything is fine, slight but manageable perf hit. But if I flip the MTU to 9k, run some traffic and then go back to 1.5k - 70%+ of CPU cycles are spent in alloc_iova (and children). Does this ring any bells? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: AMD IOMMU problem after NIC uses multi-page allocation 2023-03-30 1:14 AMD IOMMU problem after NIC uses multi-page allocation Jakub Kicinski @ 2023-03-30 2:36 ` Yunsheng Lin 2023-03-30 7:41 ` Joerg Roedel 1 sibling, 0 replies; 7+ messages in thread From: Yunsheng Lin @ 2023-03-30 2:36 UTC (permalink / raw) To: Jakub Kicinski, Joerg Roedel, Suravee Suthikulpanit Cc: iommu, netdev@vger.kernel.org, Willem de Bruijn, Saeed Mahameed On 2023/3/30 9:14, Jakub Kicinski wrote: > Hi Joerg, Suravee, > > I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19). > > The NIC allocates a buffer for Rx packets which is MTU rounded up > to page size. If I run it with 1500B MTU or 9000 MTU everything is > fine, slight but manageable perf hit. > > But if I flip the MTU to 9k, run some traffic and then go back to 1.5k > - 70%+ of CPU cycles are spent in alloc_iova (and children). > > Does this ring any bells? My bell points to below info, not sure if it will help? https://lore.kernel.org/linux-iommu/20190815121104.29140-3-thunder.leizhen@huawei.com/ > . > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: AMD IOMMU problem after NIC uses multi-page allocation 2023-03-30 1:14 AMD IOMMU problem after NIC uses multi-page allocation Jakub Kicinski 2023-03-30 2:36 ` Yunsheng Lin @ 2023-03-30 7:41 ` Joerg Roedel 2023-03-30 12:07 ` Vasant Hegde 2023-03-30 13:04 ` Robin Murphy 1 sibling, 2 replies; 7+ messages in thread From: Joerg Roedel @ 2023-03-30 7:41 UTC (permalink / raw) To: Jakub Kicinski, Vasant Hegde, Robin Murphy Cc: Suravee Suthikulpanit, iommu, netdev@vger.kernel.org, Willem de Bruijn, Saeed Mahameed Also adding Vasant and Robin. Vasant, Robin, any idea? On Wed, Mar 29, 2023 at 06:14:07PM -0700, Jakub Kicinski wrote: > Hi Joerg, Suravee, > > I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19). > > The NIC allocates a buffer for Rx packets which is MTU rounded up > to page size. If I run it with 1500B MTU or 9000 MTU everything is > fine, slight but manageable perf hit. > > But if I flip the MTU to 9k, run some traffic and then go back to 1.5k > - 70%+ of CPU cycles are spent in alloc_iova (and children). > > Does this ring any bells? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: AMD IOMMU problem after NIC uses multi-page allocation 2023-03-30 7:41 ` Joerg Roedel @ 2023-03-30 12:07 ` Vasant Hegde 2023-03-30 13:04 ` Robin Murphy 1 sibling, 0 replies; 7+ messages in thread From: Vasant Hegde @ 2023-03-30 12:07 UTC (permalink / raw) To: Joerg Roedel, Jakub Kicinski, Robin Murphy Cc: Suravee Suthikulpanit, iommu, netdev@vger.kernel.org, Willem de Bruijn, Saeed Mahameed Hi Jakub, On 3/30/2023 1:11 PM, Joerg Roedel wrote: > Also adding Vasant and Robin. > > Vasant, Robin, any idea? > I tried few things on my Milan system. I can't reproduce the issue. Can you please try the patch mentioned by Yunsheng? If its still an issue, can you please provide steps to reproduce? I will take a look. -Vasant > On Wed, Mar 29, 2023 at 06:14:07PM -0700, Jakub Kicinski wrote: >> Hi Joerg, Suravee, >> >> I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19). >> >> The NIC allocates a buffer for Rx packets which is MTU rounded up >> to page size. If I run it with 1500B MTU or 9000 MTU everything is >> fine, slight but manageable perf hit. >> >> But if I flip the MTU to 9k, run some traffic and then go back to 1.5k >> - 70%+ of CPU cycles are spent in alloc_iova (and children). >> >> Does this ring any bells? ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: AMD IOMMU problem after NIC uses multi-page allocation 2023-03-30 7:41 ` Joerg Roedel 2023-03-30 12:07 ` Vasant Hegde @ 2023-03-30 13:04 ` Robin Murphy 2023-03-30 13:10 ` Robin Murphy 1 sibling, 1 reply; 7+ messages in thread From: Robin Murphy @ 2023-03-30 13:04 UTC (permalink / raw) To: Joerg Roedel, Jakub Kicinski, Vasant Hegde Cc: Suravee Suthikulpanit, iommu, netdev@vger.kernel.org, Willem de Bruijn, Saeed Mahameed On 2023-03-30 08:41, Joerg Roedel wrote: > Also adding Vasant and Robin. > > Vasant, Robin, any idea? > > On Wed, Mar 29, 2023 at 06:14:07PM -0700, Jakub Kicinski wrote: >> Hi Joerg, Suravee, >> >> I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19). >> >> The NIC allocates a buffer for Rx packets which is MTU rounded up >> to page size. If I run it with 1500B MTU or 9000 MTU everything is >> fine, slight but manageable perf hit. >> >> But if I flip the MTU to 9k, run some traffic and then go back to 1.5k >> - 70%+ of CPU cycles are spent in alloc_iova (and children). >> >> Does this ring any bells? There is that old issue already mentioned where there seems to be some interplay between the IOVA caching and the lazy flush queue, which we never really managed to get to the bottom of. IIRC my hunch was that with a sufficiently large number of CPUs, fq_flush_timeout() overwhelms the rcache depot and gets into a pathological state where it then continually thrashes the IOVA rbtree in a fight with the caching system. Another (simpler) possibility which comes to mind is if the 9K MTU (which I guess means 16KB IOVA allocations) puts you up against the threshold of available 32-bit IOVA space - if you keep using the 16K entries then you'll mostly be recycling them out of the IOVA caches, which is nice and fast. However once you switch back to 1500 so needing 2KB IOVAs, you've now got a load of IOVA space hogged by all the 16KB entries that are now hanging around in caches, which could push you into the case where the optimistic 32-bit allocation starts to fail (but because it *can* fall back to a 64-bit allocation, it's not going to purge those unused 16KB entries to free up more 32-bit space). If the 32-bit space then *stays* full, alloc_iova should stay in fail-fast mode, but if some 2KB allocations were below 32 bits and eventually get freed back to the tree, then subsequent attempts are liable to spend ages doing doing their best to scrape up all the available 32-bit space until it's definitely full again. For that case, [1] should help. Even in the second case, though, I think hitting the rbtree much at all still implies that the caches might not be well-matched to the workload's map/unmap pattern, and maybe scaling up the depot size could still be the biggest win. Thanks, Robin. [1] https://lore.kernel.org/linux-iommu/e9abc601b00e26fd15a583fcd55f2a8227903077.1674061620.git.robin.murphy@arm.com/ ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: AMD IOMMU problem after NIC uses multi-page allocation 2023-03-30 13:04 ` Robin Murphy @ 2023-03-30 13:10 ` Robin Murphy 2023-03-31 4:06 ` Jakub Kicinski 0 siblings, 1 reply; 7+ messages in thread From: Robin Murphy @ 2023-03-30 13:10 UTC (permalink / raw) To: Joerg Roedel, Jakub Kicinski, Vasant Hegde Cc: Suravee Suthikulpanit, iommu, netdev@vger.kernel.org, Willem de Bruijn, Saeed Mahameed On 2023-03-30 14:04, Robin Murphy wrote: > On 2023-03-30 08:41, Joerg Roedel wrote: >> Also adding Vasant and Robin. >> >> Vasant, Robin, any idea? >> >> On Wed, Mar 29, 2023 at 06:14:07PM -0700, Jakub Kicinski wrote: >>> Hi Joerg, Suravee, >>> >>> I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19). >>> >>> The NIC allocates a buffer for Rx packets which is MTU rounded up >>> to page size. If I run it with 1500B MTU or 9000 MTU everything is >>> fine, slight but manageable perf hit. >>> >>> But if I flip the MTU to 9k, run some traffic and then go back to 1.5k >>> - 70%+ of CPU cycles are spent in alloc_iova (and children). >>> >>> Does this ring any bells? > > There is that old issue already mentioned where there seems to be some > interplay between the IOVA caching and the lazy flush queue, which we > never really managed to get to the bottom of. IIRC my hunch was that > with a sufficiently large number of CPUs, fq_flush_timeout() overwhelms > the rcache depot and gets into a pathological state where it then > continually thrashes the IOVA rbtree in a fight with the caching system. > > Another (simpler) possibility which comes to mind is if the 9K MTU > (which I guess means 16KB IOVA allocations) puts you up against the > threshold of available 32-bit IOVA space - if you keep using the 16K > entries then you'll mostly be recycling them out of the IOVA caches, > which is nice and fast. However once you switch back to 1500 so needing > 2KB IOVAs, you've now got a load of IOVA space hogged by all the 16KB > entries that are now hanging around in caches, which could push you into > the case where the optimistic 32-bit allocation starts to fail (but > because it *can* fall back to a 64-bit allocation, it's not going to > purge those unused 16KB entries to free up more 32-bit space). If the > 32-bit space then *stays* full, alloc_iova should stay in fail-fast > mode, but if some 2KB allocations were below 32 bits and eventually get > freed back to the tree, then subsequent attempts are liable to spend > ages doing doing their best to scrape up all the available 32-bit space > until it's definitely full again. For that case, [1] should help. ...where by "2KB" I obviously mean 4KB, since apparently in remembering that the caches round up to powers of two I managed to forget that that's still in units of IOVA pages, derp. Robin. > > Even in the second case, though, I think hitting the rbtree much at all > still implies that the caches might not be well-matched to the > workload's map/unmap pattern, and maybe scaling up the depot size could > still be the biggest win. > > Thanks, > Robin. > > [1] > https://lore.kernel.org/linux-iommu/e9abc601b00e26fd15a583fcd55f2a8227903077.1674061620.git.robin.murphy@arm.com/ > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: AMD IOMMU problem after NIC uses multi-page allocation 2023-03-30 13:10 ` Robin Murphy @ 2023-03-31 4:06 ` Jakub Kicinski 0 siblings, 0 replies; 7+ messages in thread From: Jakub Kicinski @ 2023-03-31 4:06 UTC (permalink / raw) To: Robin Murphy Cc: Joerg Roedel, Vasant Hegde, Suravee Suthikulpanit, iommu, netdev@vger.kernel.org, Willem de Bruijn, Saeed Mahameed On Thu, 30 Mar 2023 14:10:09 +0100 Robin Murphy wrote: > > There is that old issue already mentioned where there seems to be some > > interplay between the IOVA caching and the lazy flush queue, which we > > never really managed to get to the bottom of. IIRC my hunch was that > > with a sufficiently large number of CPUs, fq_flush_timeout() overwhelms > > the rcache depot and gets into a pathological state where it then > > continually thrashes the IOVA rbtree in a fight with the caching system. > > > > Another (simpler) possibility which comes to mind is if the 9K MTU > > (which I guess means 16KB IOVA allocations) puts you up against the > > threshold of available 32-bit IOVA space - if you keep using the 16K > > entries then you'll mostly be recycling them out of the IOVA caches, > > which is nice and fast. However once you switch back to 1500 so needing > > 2KB IOVAs, you've now got a load of IOVA space hogged by all the 16KB > > entries that are now hanging around in caches, which could push you into > > the case where the optimistic 32-bit allocation starts to fail (but > > because it *can* fall back to a 64-bit allocation, it's not going to > > purge those unused 16KB entries to free up more 32-bit space). If the > > 32-bit space then *stays* full, alloc_iova should stay in fail-fast > > mode, but if some 2KB allocations were below 32 bits and eventually get > > freed back to the tree, then subsequent attempts are liable to spend > > ages doing doing their best to scrape up all the available 32-bit space > > until it's definitely full again. For that case, [1] should help. > > ...where by "2KB" I obviously mean 4KB, since apparently in remembering > that the caches round up to powers of two I managed to forget that > that's still in units of IOVA pages, derp. > > Robin. > > > > > Even in the second case, though, I think hitting the rbtree much at all > > still implies that the caches might not be well-matched to the > > workload's map/unmap pattern, and maybe scaling up the depot size could > > still be the biggest win. > > > > Thanks, > > Robin. > > > > [1] > > https://lore.kernel.org/linux-iommu/e9abc601b00e26fd15a583fcd55f2a8227903077.1674061620.git.robin.murphy@arm.com/ Alright, can confirm! :) That patch on top of Linus's tree fixes the issue for me! Noob question about large systems, if you indulge me - I run into this after enabling the IOMMU driver to get large (255+ thread) AMD machines to work. Is there a general dependency on IOMMU for such x86 systems or the tie between IOMMU and x2apic is AMD-specific? Or I'm completely confused? I couldn't find anything in the kernel docs and I'm trying to wrap my head around getting the kernel to work the same across a heterogeneous* fleet of machines (* in terms of vendor and CPU count). ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2023-03-31 4:06 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-03-30 1:14 AMD IOMMU problem after NIC uses multi-page allocation Jakub Kicinski 2023-03-30 2:36 ` Yunsheng Lin 2023-03-30 7:41 ` Joerg Roedel 2023-03-30 12:07 ` Vasant Hegde 2023-03-30 13:04 ` Robin Murphy 2023-03-30 13:10 ` Robin Murphy 2023-03-31 4:06 ` Jakub Kicinski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).